0% found this document useful (0 votes)
66 views

SP Nov2022

Uploaded by

Anirban Kanungoe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

SP Nov2022

Uploaded by

Anirban Kanungoe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

IEEE SIGNAL PROCESSING MAGAZINE

SCIENTIFIC INTEGRITY VOLUME 39 NUMBER 6 | NOVEMBER 2022


Get Published in the New
IEEE Open Journal of
Signal Processing

Submit a paper today to the premier new


open access journal in signal processing.

In keeping with IEEE’s continued commitment to providing Your research will also be exposed to 5 million unique
options supporting the needs of all authors, in 2020, IEEE monthly users of the IEEE Xplore® Digital Library.
introduced the high-quality publication, the IEEE Open Journal The high-quality IEEE Open Journal of Signal Processing will
of Signal Processing. draw on IEEE’s expert technical community’s continued
In recognition of author funding difficulties during this commitment to publishing the most highly cited content.
unprecedented time, the IEEE Signal Processing Society, for The editor-in-chief is the distinguished Prof. Brendt
2022, is offering a reduced APC of USD$995 with no page Wohlberg, who specializes in signal and image processing,
limits. (This offer cannot be combined with any other inverse problems, and computational imaging.
discounts.) The rapid peer-reviewed process targets a publication
We invite you to have your article peer-reviewed and published time frame within 10-15 weeks for most accepted papers.
in the new journal. This is an exciting opportunity for your This journal is fully open and compliant with funder
research to benefit from the high visibility and interest the mandates, including Plan S.
journal will generate.

Submit your paper today!


The high-quality IEEE Open Journal of Signal Processing launched in IEEEXplore ® in
January 2020 and welcomes submissions of novel technical contributions.

Click here to learn more

Digital Object Identifier 10.1109/MSP.2022.3211373


Contents Volume 39 | Number 6 | November 2022

FEATURES 73 SP Forum
Scientific Integrity and Misconduct
in Publications

Luigi Longobardi, Tony VenGraitis,
and Christian Jutten
18  RETHINKING BAYESIAN
LEARNING FOR DATA ANALYSIS
Lei Cheng, Feng Yin, 76 Tips & Tricks
Sergios Theodoridis, Fast and Accurate Linear Fitting
Sotirios Chatzis, for an Incompletely Sampled Gaussian
and Tsung-Hui Chang Function With a Long Tail
Kai Wu, J. Andrew Zhang,
and Y. Jay Guo
53  RADIO MAP ESTIMATION
Daniel Romero and Seung-Jun Kim 85 Applications Corner
The SONICOM Project: Artificial
Intelligence-Driven Immersive Audio,
From Personalization to Modeling
Lorenzo Picinali, Brian FG Katz,
Michele Geronazzo, Piotr Majdak,
Arcadio Reyes-Lecuona,
and Alessandro Vinciarelli

ON THE COVER
A variety of topics are addressed in this issue including
scientific integrity, rethinking Bayesian learning and radio
map estimation.

COVER IMAGE: ©SHUTTERSTOCK.COM/KIRASOLLY

COLUMNS
6 Special Reports
Signal Processing at the Epicenter
of Ground-Shaking Research
John Edwards

10 SP Everywhere
Artistic Text Style Transfer
Xinhao Wang, Shuai Yang,
PG. 10 Wenjing Wang, and Jiaying Liu PG. 85

IEEE SIGNAL PROCESSING MAGAZINE  (ISSN 1053-5888) (ISPREG) is published bimonthly by the Institute of Electrical and Electronics Engineers, Inc., 3 Park Avenue, 17th Floor, New York,
NY 10016-5997 USA (+1 212 419 7900). Responsibility for the contents rests upon the authors and not the IEEE, the Society, or its members. Annual member subscriptions included in Society fee.
Nonmember subscriptions available upon request. Individual copies: IEEE Members US$20.00 (first copy only), nonmembers US$248 per copy. Copyright and Reprint Permissions: Abstracting
is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S. Copyright Law for private use of patrons: 1) those post-1977 articles that carry a code at the bot-
tom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA; 2) pre-1978 articles without fee. Instructors are
permitted to photocopy isolated articles for noncommercial classroom use without fee. For all other copying, reprint, or republication permission, write to IEEE Service Center, 445 Hoes Lane,
Piscataway, NJ 08854 USA. Copyright © 2022 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY, and at additional mailing offices.
Postmaster: Send address changes to IEEE Signal Processing Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854 USA. Canadian GST #125634188  Printed in the U.S.A.

Digital Object Identifier 10.1109/MSP.2022.3198296

IEEEIEEE
SIGNAL PROCESSING
SIGNAL MAGAZINE| |November
MAGAZINE
PROCESSING 2022|
May 2021 | 1
IEEE Signal Processing Magazine
DEPARTMENTS EDITOR-IN-CHIEF
Christian Jutten—Université Grenoble Alpes,
ASSOCIATE EDITORS—COLUMNS AND FORUM
Ulisses Braga-Neto—Texas A&M University,
France USA
3 From the Editor Cagatay Candan—Middle East Technical
Scientific Integrity: A Duty for Researchers AREA EDITORS University, Turkey
Christian Jutten Feature Articles Wei Hu—Peking University, China
Laure Blanc-Féraud—Université Côte d’Azur, Andres Kwasinski—Rochester Institute of
4 President’s Message France Technology, USA
Starting the Ethics Discussion Xingyu Li—University of Alberta, Edmonton,
in Our Community Special Issues
Alberta, Canada
Athina Petropulu Xiaoxiang Zhu—German Aerospace Center,
Xin Liao—Hunan University, China
Germany
Piya Pal—University of California San Diego,
3 Dates Ahead
Cover Columns and Forum USA
Rodrigo Capobianco Guido—São Paulo Hemant Patil—Dhirubhai Ambani Institute
State University (UNESP), Brazil of Information and Communication
H. Vicky Zhao—Tsinghua University, Technology, India
R.P. China Christian Ritz—University of Wollongong,
e-Newsletter Australia
Behnaz Ghoraani—Florida Atlantic University,
USA ASSOCIATE EDITORS—e-NEWSLETTER
Social Media and Outreach Abhishek Appaji—College of Engineering, India
Emil Björnson—KTH Royal Institute of Technology, Subhro Das—MIT-IBM Watson AI Lab,
Sweden IBM Research, USA
Anubha Gupta—Indraprastha Institute of Information
EDITORIAL BOARD Technology Delhi (IIIT), India
Hamid Palangi—Microsoft Research Lab (AI), USA
Massoud Babaie-Zadeh—Sharif University of
Technology, Iran
Waheed U. Bajwa—Rutgers University,
IEEE SIGNAL PROCESSING SOCIETY
Athina Petropulu—President
USA
Min Wu—President-Elect
Caroline Chaux—French Center of National
Ana Isabel Pérez-Neira—Vice President,
Research, France
Conferences
Mark Coates—McGill University, Canada
Shrikanth Narayanan—VP Education
Laura Cottatellucci—Friedrich-Alexander
K.V.S. Hari—Vice President, Membership
University of Erlangen-Nuremberg, Germany
Marc Moonen—Vice President, Publications
Davide Dardari—University of Bologna, Italy
Alle-Jan van der Veen—Vice President, Technical
Mario Figueiredo—Instituto Superior Técnico,
Directions
University of Lisbon, Portugal
Sharon Gannot—Bar-Ilan University,
IEEE SIGNAL PROCESSING SOCIETY STAFF
Israel
William Colacchio—Senior Manager, Publications
Yifan Gong—Microsoft Corporation, USA
and Education Strategy and Services
Rémi Gribonval—Inria Lyon, France Rebecca Wollman—Publications Administrator
Joseph Guerci—Information Systems
Laboratories, Inc., USA
Ian Jermyn—Durham University, U.K.
COVER 3 Ulugbek S. Kamilov—Washington University, USA IEEE PERIODICALS MAGAZINES DEPARTMENT
Patrick Le Callet—University of Nantes, Sharon Turk, Journals Production Manager
©SHUTTERSTOCK.COM/G/SUNSINGER

  France Katie Sullivan, Senior Manager,


Sanghoon Lee—Yonsei University, Korea Journals Production
Danilo Mandic—Imperial College London, U.K. Janet Dudar, Senior Art Director
Michalis Matthaiou—Queen’s University Belfast, Gail A. Schnitzer, Associate Art Director
U.K. Theresa L. Smith, Production Coordinator
Phillip A. Regalia—U.S. National Science
Mark David, Director, Business Development -
Foundation, USA
The International Symposium of Biomedical Imaging will be Media & Advertising
Gaël Richard—Télécom Paris, Institut
held in Cartegena de Indias, Columbia, 18–21 April 2023. Felicia Spagnoli, Advertising Production Manager
Polytechnique de Paris, France
Reza Sameni—Emory University, USA Peter M. Tuohy, Production Director
Ervin Sejdic—University of Pittsburgh, USA Kevin Lisankie, Editorial Services Director
Dimitri Van De Ville—Ecole Polytechnique Dawn M. Melley, Senior Director,
Fédérale de Lausanne, Switzerland Publishing Operations
Henk Wymeersch—Chalmers University
of Technology, Sweden

Digital Object Identifier 10.1109/MSP.2022.3198297

SCOPE:  IEEE Signal Processing Magazine publishes tutorial-style articles on signal processing research and
applications as well as columns and forums on issues of interest. Its coverage ranges from fundamental principles
to practical implementation, reflecting the multidimensional facets of interests and concerns of the community. Its
mission is to bring up-to-date, emerging, and active technical developments, issues, and events to the research,
IEEE prohibits discrimination, harassment, and bullying. educational, and professional communities. It is also the main Society communication platform addressing important
For more information, visit issues concerning all members.
http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.

2 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


FROM THE EDITOR
Christian Jutten | Editor-in-Chief | [email protected]

Scientific Integrity: A Duty for Researchers

E
thics in science is essential for vari- cess are also unacceptable and must be article. We also discussed corrective
ous reasons and is a duty for scien- met with punishment. actions, including a few new ones,
tists. The full sense of the word ethics In my view, scientific integrity is es- such as sending a letter informing the
may differ according to languages and sential and a duty for all scientists for employer of a scientist determined to
countries. For instance, in France, we the following reasons: be guilty of misconduct.
typically make a distinction between ■ We have a duty to Society, especially
ethics and scientific integrity, while since most of us are funded by pub- In this issue
scientific integrity is a part of ethics lic money, to conduct our research in Readers will again find a wide range
in the United States. For instance, the a manner that instills confidence and of interesting articles, with two feature
article “The Submerged Part of the AI- does not instead erode it. articles and a collection of five column
Ceberg,” [1] published in the September ■ Many scientists are involved in the articles. The feature article, “Rethink-
issue of IEEE Signal Processing Maga- training and supervision of young ing Bayesian Learning for Data Analy-
zine (SPM), discusses the ethical issues researchers. It is thus our duty to sis,” proposes a comprehensive tutorial
concerning the research in our domain, inculcate in them sound research on the Bayesian approach with priors
i.e., a set of philosophical reflections on practices that are based on scientif- promoting sparsity, which is illustrated
the interest and usefulness of our work ic integrity. on three typical tools used in data sci-
for humanity and for the Earth. These IEEE’s rules for good practices in pub- ence: deep neural networks, tensor de-
questions are complex and must prompt lications, reviews, and so on are detailed composition, and Gaussian processes.
debates between scientists and society. in the IEEE Publication Services and The second feature article, “Radio
Conversely, scientific integrity is a set Products Board Operations Manual Maps Estimation,” is a tutorial on radio
of good practices in the sciences, which [3]. For a simple, fast, and comprehen- map, which provide a radio-frequency
has not been discussed but is strictly sive overview of this manual, I posed spectrum landscape, with many appli-
applied. All of the learned scientific a few questions to Luigi Longobardi, cations in wireless communications
societies promote integrity. Of course, director of publishing ethics and con- and networking.
IEEE, as such and also as a publisher duct at IEEE, and Tony VenGraitis, Among the columns, you will find a
of many scientific journals, strongly program manager of publication eth- “Tips & Tricks,” which proposes an ef-
supports scientific integrity. Scien- ics for IEEE publications, and their ficient method for estimating an incom-
tific integrity and misconduct are at answers are detailed in “Scientific pletely sampled Gaussian function; an
the core of a few talks [2] at the an- Integrity and Misconduct in Publica- SP Forum,” which presents a review of
nual IEEE Panels of Editors meeting, tions” in this month’s issue of SPM methods for producing text with artis-
which bring together volunteers and on pages 73–75. tic effects; and a presentation of some
staff members to discuss assorted pub- Publishing-related misconduct has of the results from the European project
lication-related topics. Serious failures also been discussed in recent meetings SONICOM, which focuses on person-
related to scientific integrity include of the SPS Publication Board. Editors- alized immersive audio. Finally, John
plagiarism and data falsification or in-chief are tasked with confronting Edwards, as he has done for almost
fabrication, but misconduct related to misconduct and applying the appro- every SPM issue since 2011, offers an
authorship and during the review pro- priate corrective measure in response article in the “Special Reports” cat-
to it. A preventive step is the “custom egory. The article, “Signal Processing
questions” that must be completed in
Digital Object Identifier 10.1109/MSP.2022.3198298
Date of current version: 27 October 2022 ScholarOne when authors submit an (continued on page 84)

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 3


PRESIDENT’S MESSAGE
Athina Petropulu | IEEE Signal Processing Society President | [email protected]

Starting the Ethics Discussion in Our Community

I
EEE members are bound by the IEEE deploy a product. It is only recently that When a technology promises obvi-
Code of Ethics [1]. By becoming IEEE in engineering schools located in the ous benefits, it may be easy to overlook
members, we commit ourselves to the United States, we started looking into potential harms to society and the envi-
highest ethical and professional conduct. ethical constraints, i.e., the impact of a ronment. Take, for example, artificial
We agree to uphold the practices of ethi- technology on the environment in terms intelligence (AI), offering tremendous
cal design and sustainable development; of waste produced; opportunities by
protect the privacy of others; and prompt- the power required enabling human-
ly disclose factors that might endanger for development Can companies that are like abilities in
the public or the environment. We strive or for operation just starting up invest in machines. Biasing
to improve the public understanding of and safety of users; ethical practices, or is this AI training data
the implications of technology. We also and in general, the only feasible for already or algorithms can
seek to avoid conflicts of interest; accept impact to the job lead to ethics vio-
established names?
honest criticism of technical work; and market and society lations. Google’s
treat all persons fairly independent of in general. In the i m a ge r e c og n i-
race, religion, gender, disability, age, United States, these considerations are tion system wrongly classified images
national origin, sexual orientation, gen- studied in the senior year of college, dur- of minorities [2]; the Apple Card, ad-
der identity, or gender expression. ing the capstone design project, when it ministered by Goldman Sachs, has
How many have actually read the may be too late for students to appreciate come under recent scrutiny for gender
IEEE Code of Ethics? Further, is the ex- the importance of those issues. bias [3], and software used to sentence
istence of a code of ethics sufficient to As researchers and practitioners, criminals was shown to be biased
avoid negative impacts to our lives from we continue on the same path. Most against minorities [2], [4]. Deepfakes,
the technologies we develop? do recognize the importance of eth- enabled by machine learning (ML) and
A recurring story nowadays is that a ics and ethical behavior, but when we AI, can generate deceptive visual and
technology is introduced; its popularity publish or put out a product, we do not audio content.
grows fast; it reaches a point where it really think about whether we may be Further, AI and ML have a huge
has permeated our lives; and then, soci- in violation of ethics principles. Of carbon footprint. Training AI models
ety realizes what kinds of problems that course, doing the check is not always requires a lot of data, which need to be
technology may create. When dealing straightforward. stored in data centers, and this takes a
with societal impact as an afterthought, For a business, where the ultimate lot of energy. To train the final version
it is often too late to introduce meaning- success is the financial profit, the re- of Megatron-LM (a neural network for
ful fixes. quirement to uphold ethical principles natural language applications), Nvidia ran
Starting from school, we have is often met with skepticism. Will the 512 V100 GPUs over nine days, consum-
learned to focus on the technical part of ethical product be measured in the ing 27,648 kWh. This is the cost of three
the idea and not on the problems the idea framework companies already use to average U.S. households in a year [5]. And
can lead to down the road. The techni- account for value? How much does this is just one algorithm. Data centers
cal details are sufficient to publish an “ethics” add to the cost of doing busi- use huge amounts of energy, consuming
article or file for a patent, and ultimately, ness? Can companies that are just start- 10–50 times the energy per floor space
ing up invest in ethical practices, or of a typical commercial office building.
is this only feasible for already estab- Several professional organizations
Digital Object Identifier 10.1109/MSP.2022.3198299
Date of current version: 27 October 2022 lished names? (for example, NeurIPS [6] and CVPR

4 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


[7]) have developed ethical guidelines lives and the environment, the abil- [2] J. Vincent. “Google ‘fixed’ its racist algorithm by
removing gorillas from its image-labeling tech.” The
for authors, encouraging them to con- ity to deliver technology that is trust- Verge. Accessed: Sep. 26, 2022. [Online]. Available:
sider not only the potential benefits of worthy and meets ethical constraints https://www.theverge.com/2018/1/12/16882408/
google-racist-gorillas-photo-recognition-algorithm-ai
their research but also the potential will gain in importance. Then, orga-
[3] “Apple Card under scrutiny for alleged gender
negative societal impacts and to adopt nizations that have not incorporated bias in algorithms.” Local3News.com. Accessed:
measures to mitigate risk. In the IEEE ethics in their practices will become Sep. 26, 2022. [Online]. Available: https://www.
local3news.com /apple-card-under-scrutiny-for
Signal Processing Society (SPS), we less competitive. -alleged-gender-bias-in-algorithms/article_7b56b4f5
have started the ethics discussion within If you read this article and are interested -e005-545e-adf6-db79ff51af89.html
[4] J. Angwin, J. Larson, S. Mattu, and L. Kirchner,
our community. We have established a in contributing to the SPS ethics discussion, “Machine bias,” ProPublica, New York, NY, USA,
team of volunteers representing vari- please feel free to contact me. We always 2016. [Online]. Available: https://www.propublica.org/
article/machine-bias-risk-assessments-in-criminal
ous SPS technical committees, and we welcome your input and suggestions. -sentencing
are developing recommendations for [5] M. Labbe. “Energy consumption of AI poses
researchers and practitioners to ensure environmental problems.” TechTarget. Accessed:
Sep. 26, 2022. [Online]. Available: https://www.
that ethical issues are addressed dur- techtarget.com/searchenterpriseai/feature/Energy
ing the conception of the research or -consumption-of-AI-poses-environmental-problems
product, when there is a better chance to [6] “Ethical review guidelines,” in Proc. 36th Conf.
Neural Inf. Process. Syst., 2022. [Online]. Available:
mitigate problems. https://nips.cc/public/EthicsGuidelines
I want to believe that going for- References [7] “Ethics guidelines.” CVPR. Accessed: Sep. 26,
[1] “IEEE Code of Ethics.” IEEE. Accessed: Sep. 8, 2022. [Online]. Available: https://cvpr2022.thecvf.
ward, as people become more aware 2022. [Online]. Available: https://www.ieee.org/ com/ethics-guidelines
of the impact of technology on our about/corporate/governance/p7-8.html SP

Interested(in(learning(about(upcoming(SPM(issues(or(open(calls(for(papers?(
Want(to(know(what(the(SPM(community(is(up(to?(
Follow%us%on%twitter%(@IEEEspm)%and/or%join%our%LinkedIn%group%%
(www.linkedin.com/groups/8277416/)%to%stay%in%the%know%and%share%your%ideas!%

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 5


SPECIAL REPORTS
John Edwards

Signal Processing at the Epicenter of Ground-Shaking Research


Researchers turn to signal processing to minimize earthquake damage, rescue victims,
and perhaps even provide advance warnings

E
arthquakes have afflicted people the earthquake source to the seismom- Trugman explains that understand-
throughout history. Today, thanks to eter recording it—or the person expe- ing the spatial patterns of ground
advanced technology, more is known riencing it. motion in different frequency bands is
about earthquakes, and more can be done Trugman, working in collaboration necessary to develop adequate build-
to protect people against them. Signal with Brown University geophysicist ing design codes for structures posi-
processing is playing a key role as inves- Victor Tsai, was able to access one tioned nea r active faults. Cur rent
tigators examine ways to combat one of of the densest observational arrays of seismic hazard models assume a uni-
humanity’s most deadly foes. research seismome- form ground motion
ters ever created. The In reality, the shaking prediction model—
A new view result was the most entirely ignoring the
A project at the University of Nevada at highly detailed mea-
amplitude can vary rad iat ion pat ter n.
Reno is changing how researchers view surements of the spa- systemically depending The resea rchers
potential earthquake damage. Using tial pattern of ground on the direction of the are driving into the
data collected by one of the densest motions in different earthquake source to the physics of the prob-
seismic arrays ever deployed, research- frequency bands. seismometer recording lem. Specifically,
ers revealed that earthquakes emit their The analysis was it—or the person they’re looking for
strongest seismic shockwaves in four based on measure- what might cause the
opposing directions. ments collected by
experiencing it. observed radiation
Daniel Trugman, an assistant pro- a set of two dozen pattern to differ from
fessor in the university’s Nevada Seis- small earthquakes recorded by the classic theoretical earthquake models.
mological Laboratory, reports that Large-n Seismic Survey in Oklahoma When earthquakes strike, they
the study focused on what’s called (LASSO), an array of 1,829 seismic release a sudden burst of seismic energy
the earthquake radiation pattern, a sensors deployed for 28 days in 2016 to at multiple frequencies, but the actual
technical term for the spatial pattern monitor a remote corner of the state cov- ground shaking people feel ranges from
of ground motion. The effect, which ering 15 by 20 mi. about 1 to 20 Hz. The study revealed
creates a pattern resembling a four- The array helped the research- that low-frequency energy–approxi-
leaf clover, has been widely known ers understand the physics behind the mately 1–10 Hz–is emitted from the
and obser ved for many yea rs but observed radiation pattern as well as the fault in four directions. This observa-
never before measured in such precise limitations of current physical models. tion is critical since buildings are highly
detail (Figure 1). “This study is definitely not the end of vulnerable to low-frequency waves. The
It turns out that, when an earth- the story, but just a starting point for a four-leaf clover pattern wasn’t found for
quake occurs, the shaking amplitude is long-term project,” Trugman says. “The higher-frequency waves, which traveled
not the same in all directions. In reali- idea dates all the way back to my under- at equal strength in all directions.
ty, the shaking amplitude can vary sys- graduate honors thesis at Stanford on The study’s focus was to examine
temically depending on the direction of a tangentially related topic, but started how the radiation pattern varied as a
in earnest once I began a serendipitous function of frequency. “To do this, we
collaboration with Victor Tsai, who was bandpass filtered the data into a range of
Digital Object Identifier 10.1109/MSP.2022.3182932
Date of current version: 27 October 2022 interested in the same problem.” overlapping frequency bands spanning

6 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 1053-5888/22©2022IEEE


the signal of interest and measured the ated by more damaging earthquake ers set out to investigate whether cows,
ground motion amplitude of the earth- S waves, as opposed to the P waves that sheep, and dogs can actually detect
quake waveforms in each frequency were the focus of the initial project. “We early signs of earthquakes. To do so,
band,” Trugman says. “Because we have hope this research will both improve our they attached sensors to the animals in
measurements at each station—for each understanding of shaking hazards and earthquake-prone areas. Martin Wikelski,
earthquake and each frequency—we also the more fundamental basic physics director at the Max Planck Institute of
can then examine radiation patterns as of earthquake rupture,” Trugman says. Animal Behavior and principal investiga-
a function of frequency.” What’s important to understand is tor at the Center for the Advanced Study
Trugman notes that the approach that this work is something of a pilot of Collective Behavior, says that the
was relatively simple. “Easy enough to study, with some really intriguing obser- research became possible only after the
understand and implement, but effec- vations, Trugman notes. “More work is biologging revolution arrived, enabling
tive in solving the scientific problem we needed to really understand the physics researchers to continuously observe the
were after,” he says. “Because of this, of the problem and how they may con- location and behavior of animals.
there wasn’t really a need to try some- nect to hazards.” In search of proof of unusual ani-
thing more fancy.” mal behavior prior to earthquakes, the
The researchers developed unique Animal talk researchers traveled to several active
algorithms to filter the LASSO data to No human has ever been able to reliably volcanos, ultimately selecting Mount
test the possibility of uneven shaking predict when or where an earthquake Etna due to its generally predictable
near faults. The algorithms confirmed may occur. That’s why researchers at eruptions and the fact that side chan-
the concept. “At low frequencies, each the Max Planck Institute of Animal nels are known to approach the surfaces
earthquake exhibited a four-leaf clover Behavior and the Cluster of Excellence at the slopes of the volcano (Figure 2).
pattern; at higher frequencies, there was Center for the Advanced Study of Col- “Here, we found evidence that a collec-
no clear pattern, just as Tsai had antici- lective Behavior at the University of tive of farm animals forecast eruptions
pated,” Trugman says. Konstanz are focusing their attention by several hours,” Wikelski says. “This
The researchers recently received fund- on animals. is when we patented the forecast sys-
ing from the National Science Foundation Numerous eyewitnesses over many tem.” The patent is titled “Disaster Alert
to pursue their research in greater detail. centuries have reported animals behaving Mediation Using Nature.”
The upcoming work will concentrate on unusually shortly before an earthquake. The subject animals were equipped
understanding the radiation patterns cre- Inspired by such reports, the research- with Bosch accelerometers, the same

F Band: 2.5–5.5 Hz F Band: 4–8 Hz F Band: 6–12 Hz F Band: 8–16 Hz


Double-Couple Double-Couple Double-Couple Double-Couple
Percentage: 93.9% Percentage: 93.6% Percentage: 92.5% Percentage: 90%

F Band: 12–20 Hz F Band: 16–24 Hz F Band: 20–30 Hz F Band: 25–35 Hz


Double-Couple Double-Couple Double-Couple Double-Couple
Percentage: 84.9% Percentage: 72.7% Percentage: 59.2% Percentage: 48.3%

FIGURE 1. The four-leaf clover radiation pattern, comparing observations with data. (Source: Daniel Trugman; used with permission.)

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 7


type used with vehicle airbags. Wire- “Signal processing is essential in receive funding for extensive tests that
less communication technologies our project, but we do it in two ways,” are necessary to show under which
included very-high-frequency, ultrahigh Wikelski says. “A priori time series circumstances animal information can
frequency, and International Coopera- comparisons, which need a lot of data be helpful as additional information to
tion for Animal Research Using Space and processing power, or biological existing seismological information,” he
(ICARUS) satellite links. “However, the information analysis using thresholds of says. “We have only barely begun with
ICARUS project is now on hold due to animal behavioral reactions.” testing, and, therefore, this is an open-
the war in Ukraine because it ended our Wikelski reports that the project end and only expanding study.”
bilateral German–Russian collabora- has been hampered by a general disbe-
tion,” Wikelski says. “We are currently lief among engineers that animals can Robo roaches
establishing new receivers in orbit to get contribute to earthquake forecasting. A research team led by Hirotaka Sato,
back to this global facility.” “Therefore, it is still an uphill battle to an associate professor who holds the
Provost’s Chair in Mechanical and
Aerospace Engineering at Singapore’s
Nanyang Technological University
(NTU), has successfully developed an
insect–computer hybrid system that’s
designed to help recovery teams locate
earthquake victims.
The system, which uses a live
cockroach as a platform (Figure 3), is
designed to locomote autonomously
within an obstructed area, such as a col-
lapsed building, without any operator
assistance. It can detect a human pres-
ence with its onboard camera and wire-
lessly report data to a remote console.
Existing minirobots currently used
in earthquake survivor recovery are
burdened by high power consumption
(more than 100 mW), most of which is
FIGURE 2. Researchers Uschi Müller, Martin Wikelski, and Georg Heine (from left to right) observ-
ing tagged goats in their natural habitat at Mt. Etna, Sicily. (Source: Christian Ziegler; used with dedicated to locomotion, limiting oper-
permission.) ating time to just a few minutes. “This
is because only a very small battery can
be mounted on the robot, and the battery
capacity is limited,” Sato explains. “On
the other hand, the power needed for an
insect hybrid robot to locomote is just
0.1 mW or much less.” The remaining
power from the onboard battery can be
used by other critical components, such
as a radio, camera, and sensors.
A decade ago, during early the stag-
es of their insect hybrid robot develop-
ment, the researchers required only
a radio and a microprocessing unit
(MPU) for remote locomotion control.
“Now, a decade later . . . we know that
the hybrid robot can be useful for rescue
missions, so we integrated sensors and
localization devices needed for search
and rescue missions,” Sato says.
To create the current insect hybrid
robot, the team implants metal elec-
FIGURE 3. Researchers at Singapore’s NTU holding insect hybrid robots. (Source: NTU Singapore; trodes into the neurons and muscles of
used with permission.) a Madagascar hissing cockroach via

8 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


microsurgery. An external electrode is The IMU data are collected at sam- to improve their navigation algorithm
connected to the MPU. “We study the pling rate of 100 Hz. The signal is then to better account for the insect’s native
insect’s physiology in detail, where we processed by a digital motion pro- instincts and reactions. “We also need to
find out which neurons move which cessor embedded inside the IMU by optimize the backpack design and make
muscles and how the insect achieves 6-degrees-of-freedom Kalman filter- it smaller and less obtrusive, so we can
its natural locomotion,” Sato says. “We ing to obtain the orientation and global minimize any hindrance or effect on the
then transmit tiny electrical impulses to acceleration. “Acceleration is then insect’s locomotion.”
trigger these locomotion actions so we low-pass filtered and baseline filtered Sato states that the entire team takes
can nudge the insect to go where we before being used for the navigation its research very seriously. “We hope
want it to.” process,” Sato explains. “These meth- to use this technology to save people in
A components board is mounted on ods are simple yet effective to handle disaster areas,” he explains. “Our ulti-
the insect’s back. “We call it a backpack,” the input data of the IMU to meet our mate goal is to reduce the fear of disaster
Sato says. The only information the expected requirements.” He notes that and increase the chances of survival.”
backpack needs to know is the position the methods are also compatible with
of the hybrid robot itself and the des- an onboard MCU, with only minimal Author
tination. The backpack automatically computational power required. John Edwards (jedwards@johnedwards
stimulates the insect based on its physi- The final filtering for acceleration media.com) is a technology writer based
cal position in relation to its destination. was carefully designed to achieve both in Gilbert, Arizona, 85234, USA. Follow
“In other words, when the hybrid robot performance and calculation efficiency, him on Twitter @TechJohnEdwards.
deviates from its route to the destination, Sato says. He says the team still needs  SP
the backpack stimulates the insect to
return it back to the route, back toward
its intended destination,” he explains.
The backpack also contains an
internal measurement unit (IMU) to
monitor reactions. “If the insect is
blocked by an obstacle, the velocity
and/or angular velocity will be low;
Faculty Positions in Speech Processing (Computer Science)
thus, the backpack will temporarily
shut off the stimulator so that the insect The Department of Computer Science at the National University of Singapore
moves freely and finds a way around (NUS) invites applications for a tenure-track faculty position in speech processing.
The Department enjoys ample research funding, moderate teaching loads,
the obstruction,” Sato explains. “The excellent facilities, and extensive international collaborations. We have a full range
stimulator reactivates after [a] defined of faculty covering all major research areas in computer science and a thriving PhD
time and restarts the navigation to the program that attracts the brightest students from the region and beyond. The CS
intended destination.” department highlights can be found in the QR code below.
All signal processing is handled NUS is an equal opportunity employer that offers highly competitive salaries,
within the backpack. “It’s not good if and is situated in Singapore, an English-speaking cosmopolitan city that is
every hybrid robot transfers data to the a melting pot of many cultures, both the east and the west. Singapore offers
high-quality education and healthcare at all levels, as well as very low tax rates.
remote station, as it needs a high polling
rate and consumes much power,” Sato Candidates for Assistant Professor positions should demonstrate excellent
research potential and a strong commitment to teaching. Truly outstanding
says. “So, we use an algorithm designed
Assistant Professor applicants will also be considered for the prestigious Sung
by our team to have autonomy.” In fact, Kah Kay Assistant Professorship as well as the Presidential Young Professor.
each hybrid robot is fully autonomous.
Candidates for tenured Associate Professor or full Professor should demonstrate
“No remote operator is needed for navi- excellent track records in research, teaching and thought leadership.
gation,” Sato adds.
Application Details:
Victim detection is also handled Submit the following documents (in a single PDF) online via:
within the backpack. A machine learn- https://faces.comp.nus.edu.sg
ing model classifies human or nonhuman
• A cover letter that indicates the position applied for and the main research interests
targets based on the infrared (IR) images • Curriculum Vitae
acquired by the onboard IR camera. • A teaching statement
“We are able to achieve 87% accuracy • A research statement
of human detection from a single image • Contact information of 3 referees

alone, but most likely the camera will be Job requirement:


taking multiple images at intervals when A PhD degree in Computer Science or related areas.

heat is detected, which means the accu-


racy will be beyond 99%,” Sato states.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 9


SP EVERYWHERE
Xinhao Wang  , Shuai Yang  , Wenjing Wang  , and Jiaying Liu

Artistic Text Style Transfer


An overview of state-of-the-art methods and datasets

W
ord art, which is text rendered with such as the burning flames in Figure 1 artistic text style transfer. First, we for-
properly designed appealing artistic and exquisite decorations in Figure 2. mulate the task. Second, we investigate
effects, has been a popular form of Manually creating vivid text effects and classify state-of-the-art methods
art throughout human history. Artistic text requires lots of time and a series of com- into nondeep- and deep-based methods,
effects are of great aesthetic value and plicated operations: observing the target introduce their core ideas, and discuss
symbolic significance. Decorating with glyphs, designing appropriate artistic their strengths and weaknesses. Third,
appropriate effects not only makes text effects, warping the texture to match the we present several benchmark datasets
more attractive but also significantly character shapes, and so on. It consumes and evaluation methods. Finally, we
enhances the atmosphere of a scene. hours of time even for well-trained summarize the current challenges in
Thus, artistic text effects are widely used designers. To produce word art more this field and propose possible direc-
in publicity and advertising. Some text conveniently and efficiently, artistic text tions for future research.
effects are simple, such as colors and style transfer has been proposed recently
shadows, while some can be complex, to automatically render text with given Task formulation
artistic effects. Artistic text style transfer aims at
In this article, we provide a compre- automatically turning plain text into
Digital Object Identifier 10.1109/MSP.2022.3196763
Date of current version: 27 October 2022 hensive overview of current advances in fantastic artworks with given artistic

Source Text S Source Effects S′ Target Text T Result T ′ Source Effects S′ Target Text T Result T ′
Input Output Input Output
(a) (b)

Source Target Text T Result T ′


Effects S ′
Input Output
(c)

FIGURE 1. An overview of application scenarios of text style transfer methods. (a) Transfer of text effects with supervision. (b) Joint transfer of text effects
and fonts without supervision. (c) Transfer of effects from arbitrary style images without supervision.

10 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 1053-5888/22©2022IEEE


effects. According to the input, we Nondeep text effect effect transfer. We first introduce each
divide the artistic text style transfer transfer methods method and then discuss their strengths
problem into three categories from easy Early research mainly regards text and weaknesses.
to difficult: supervised effect transfer, effects as textures. These methods
unsupervised effect transfer, and joint adopt image patches to model artistic Supervised text effect transfer
font and effect transfer. styles and use a technology called patch Yang et al. [2] first raised the brand-new
For a supervised effect transfer, as matching [1] to synthesize textures topic of text effect transfer and
shown in Figure 1(a), the source effects according to the text shapes. Intuitively, designed a nondeep algorithm T-Effect
Sl in addition to the corresponding patch matching [1] divides the source specialized for rendering awesome
nonstylized image S are required. The effect image Sl and target text image T word art. Compared to a general image
algorithms learn the transformation into overlapped patches. For each patch style transfer, the authors pointed out
between them, then apply it to the tar- of T, a set of best-matched patches three new challenges that text effect
get text T to synthesize the result Tl . from Sl is collected, among which one transfer faces. 1) Text effects and text
An unsupervised effect transfer, on the is chosen according to a certain criteri- shapes are extremely diverse. 2) It is
other hand, gets rid of the dependency on and used to generate the texture of challenging to compose the glyph and
on S and directly builds the transforma- the target patch. Since the texture and style elements properly. 3) The input
tion by extracting the proper features style information come directly from text images are too simple to instruct the
of source effects Sl and target text T, the source effect image, these patch- placement of different subeffects. As an
as shown in Figure 1(b). Since S is based methods are able to generate rich attempt to solve these ­difficulties, Yang
not required, the constraints on Sl are texture details. et al. initially investigated and analyzed
further relaxed, where Sl can be arbi- In this section, we review some well-designed text effects and summa-
trary style images besides text effects nondeep text effect transfer methods rized their key characteristics: high
[Figure 1(c)]. As for joint font and effect according to the four typical application correlation between patch patterns
transfer, it considers fonts as a part of scenarios in Figure 3: effect transfer (e.g., color and scale) and their dis-
the style, thus aiming at transferring with/without supervision, effect transfer tances to text skeletons. Based on this
artistic effects and text fonts jointly. with human interaction, and dynamic clue, the patch-matching algorithm [1]

(a) (b)

(c) (d)

FIGURE 2. Some results of different text style transfer methods. (a) User-interactive effects transfer. (b) Decorative elements effects transfer.
(c) Shape-matching effects transfer. (d) Joint effects and fonts transfer.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 11


is improved from two perspectives: characteristics of the source style to the saliency detection to extract them and
patch preparation and a matching text while maintaining the main struc- transfer them through matching key
strategy. For patch preparation, ture of the text stroke. Based on the contour points. These pretransferred
T-Effect automatically detects the mapping, a patch-matching [1] texture structural textures act as a prior in the
optimal patch scale to depict texture transfer algorithm guided by saliency following step. Combining semantic
patterns around each pixel. In the pro- constraints is proposed, yielding and structure information enables an
cess of matching, T-Effect processes satisfying results without any supervi- improved patch-matching algorithm [1]
effects and glyphs at the same time, sion. This method can be further extend- to provide high-quality t­exture with
takes the distance from texture patch- ed to graphic design, where text is content awareness and low-­level details.
es to text skeletons into consideration, inserted into an image, matching the As shown in Figure 2(a), this method is
and adds a psychovisual term to avoid foreground and harmonizing with the able to precisely control the spatial dis-
texture overrepetitiveness. However, background. The authors designed a tributions of text effects with user guid-
this method requires a corresponding new context-aware synthesis framework ance. The combined semantic and
nonstylized image S, as shown in Fig- to determine where to place and how to structure guidance also invests this
ure  3(a), to learn the transformation embed the text, making the algorithm a method with great flexibility; therefore,
between text and artistic text. Unfortu- powerful tool for computer-aided post- it can further generalize to other shapes
nately, such a pair of inputs is often er design. besides text.
unavailable in practice, which limits
its application scenarios. Interactive effect transfer Dynamic text effect transfer
An artistic text effect transfer can also With the development of mobile Inter-
Unsupervised text effect transfer introduce user interaction, in which the net and social media, videos and graph-
To tackle the aforementioned issue, texture is transferred under user guid- ic interchange formats, which are more
Yang et al. [3] further presented a novel ance. Men et al. [6] established a gener- vivid informative carriers of visual
unsupervised algorithm UT-Effect to al framework of user-guided texture information, have become widespread.
stylize the text only with a target text transfer for multiple tasks. To control Extending the static effect transfer on a
image T and an arbitrary source effects the spatial distribution of stylized tex- single image to a dynamic effect trans-
image Sl , as in Figure 3(b). To build up ture, this method requires users to anno- fer on videos has great potential appli-
the mapping relation between different tate semantic maps for both source and cations. Compared with the supervised
modalities, UT-Effect [3] first extracts target texts, as shown in Figure 3(c). static effect transfer in Figure 3(a), the
the main structural imagery of the style Although introducing semantic maps dynamic effect transfer replaces the sin-
image with the help of a texture-remov- provides a more accurate semantic gle image Sl with a set of video frames,
al algorithm [4] and superpixel extrac- mapping relationship, the texture as shown in Figure 3(d). The result Tl
tion [5]. The extracted structure arrangement inside each semantic area correspondingly becomes a set of video
naturally serves as a guidance map and is not taken into consideration, making frames. The first method for dynamic
builds a preliminary mapping to the text. some existing structural textures easily effects was proposed by Men et al. [7].
The mapping is then refined by a bidi- lost in the transfer process. To success- To achieve temporal consistency and
rectional legibility-preserving structure fully transfer these unmarked structural preserve the spatial texture continuity,
transfer algorithm, which adds shape textures, the authors used content-aware this algorithm simultaneously carries

S: Source Text
S ′: Source Effects
S S′ T T′ S′ T T′ T: Target Text
(a) (b) T ′: Result

Human Interaction

Stylization

S S′ T T′ S S′ T T′
(c) (d)

FIGURE 3. A brief introduction to nondeep text style transfer methods. (a) Supervised effects transfer. (b) Unsupervised effects transfer. (c) Supervised
effects transfer with human interaction. (d) Supervised dynamic effects transfer.

12 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


out patch matching across all key- A generative adversarial network In this section, we introduce several
frames, during which common nearest- (GAN) is the basis of most deep-based representative deep-based text effect
neighbor patches are found for all text effect transfer methods. In a GAN, transfer methods.
frames. In addition, the authors intro- a subnetwork called a discriminator
duced simulated annealing for deep distinguishes real text effect images Disentangle the text and effect
direction-guided propagation to ensure from fake ones generated by the text One intuitive path toward text effect
that complicated effects can be com- effect transfer network. The two net- transfer is to characterize the glyphs and
pletely synthesized. However, this works are trained together in an ad- styles separately and realize flexible
method also requires the corresponding versarial manner until they achieve effect transfer by disentangling and
nonstylized image of the dynamic equilibrium. Based on the applicabil- recombining them. This idea was first
effects video, which is usually not avail- ity, deep-based methods can be divided utilized by Yang et al. [8], who devel-
able in practice. into two categories: multistyle-per-​mod- oped an encoder–decoder-based texture
el and per-style-per-model. Mult i- effects transfer GAN (TET-GAN) to
Deep-based text effect style-per-model methods are able to support stylization and destylization at
transfer methods process multiple effects with a single the same time. The authors designed
The aforementioned nondeep algo- trained network but usually rely on a three tasks to simultaneously train TET-
rithms are capable of faithfully trans- large amount of training data. A basic GAN: text autoencoding, stylization,
ferring prescribed style effects and framework is shown in Figure 4(a); the and destylization. In this way, the
synthesizing appealing stylized text. network takes a pair of source effects encoders are capable of disentangling
They are all based on patch matching Sl and target text T as input and produc- content and style features, while decod-
[1], whose essence is rearranging es the stylized text Tl . During training, ers are able to recombine content and
image patches. Consequently, nond- a discriminator is utilized to improve style features to synthesize stylized text.
eep algorithms typically have limited the visual quality of synthesized re- In addition, a self-stylization training
flexibility and a high time complexity sults. Per-style-per-model methods, on scheme can be used for one-shot fine-
because of the iterative patch-match- the other hand, are mainly designed for tuning when facing an unseen effect.
ing procedure. The need to address a more challenging one-shot learning This method has a good performance
these limitations gives birth to deep- task, where the network is only super- when the spatial distribution of the tex-
based algorithms. Instead of operating vised by a single training image pair. ture in text effects is highly related to its
on the image patches, deep-based One representative framework is dem- distance from the text skeleton. But it
methods automatically extract features onstrated in Figure 4(b). As shown in still produces poor results when the net-
of artistic text effects and text shapes. the top part, the network uses the given work fails to recognize the glyph in
These high-dimensional features sample to learn the transfer of structure style images.
learned from data disentangle the and texture during training. Then, as
characteristics of word art and are eas- shown in the bottom part, the learned Transfer the effects with
ier to adjust. Thus, by operating in the model can be applied to the target text decorative elements
feature space, deep-based methods during testing. The limitation is that All of the aforementioned mentioned
serve as a more flexible and powerful per-style-per-model methods only can approaches assume that the styles are
editing tool. learn one style in the training process. uniform within or outside the text, thus

S: Source Text S ′: Source Effects T: Target Text T ′: Result


D D

Enc Encoder Dec Decoder D Discriminator


Enc Dec Enc Dec
Train Test

Sketchy S S S′

Enc Dec D Enc Dec Enc Dec

S′ T T′ T Shape-Matching T T′
(a) (b)

FIGURE 4. The two categories of deep-based text style transfer methods. (a) Multistyle-per-model methods. (b) Per-style-per-model methods.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 13


failing to render the exquisite decora- forward structure and texture transfers. However, the font shape of a reference
tive elements that are commonly used Discriminators are applied to enhance word art usually contains rich aesthetic
in artistic text design and resulting in the generation effects. implications that should not be neglect-
visual quality degradation. To address For the latter, the authors built a ed. To capture the full style informa-
this problem, Wang et al. [9] proposed scale-controllable module to empower tion, some subsequent works are
to detect, separate, and recombine these the network to learn the style features proposed to jointly transfer both font
important embellishments. First, a seg- on a continuous scale. In this way, the and text effects. Since it is rather diffi-
mentation network is trained to detect deformation degree can be simply con- cult to extract the characteristics of a
the decorative elements. Then, based on trolled with a weight parameter. As font with only one sample, these joint
the segmentation results, the decorative a representative per-style-per-model font and text effect transfer methods
elements are separated from basal text method, the framework of Shape- usually need several reference source
effects, and a text style transfer network Matching GAN is demonstrated in Fig- effect images as input. Some of the
infers the basal text effects for the target ure 4(b). After being trained with the methods are introduced and discussed
text. Finally, cues for spatial distri- bidirectional shape-matching strategy, in this section.
butions and element diversities are the networks can render target text with
characterized to jointly integrate the the learned effects to generate appeal- Few-shot transfer for English alphabet
decorative elements onto the target text. ing results. This method requires only Multicontent GAN (MC-GAN) [12] is
Similar to [8], a one-shot fine-tuning one style example for training and the first end-to-end solution for synthe-
scheme is proposed to empower the outperforms previous algorithms in sizing ornamented glyphs. The authors
network to extend to a new style with deforming the shape of text to match divided the problem into two parts,
only one example. This algorithm per- the style image, as shown in Figure 2(c), modeling the overall glyph shape and
forms especially well when the refer- thanks to their bidirectional shape- then synthesizing the final appearance
ence style image contains elaborate matching strategy. However, one limi- with color and texture. To enable this,
decorative items. Some representative tation is that Shape-Matching GAN is they developed a stacked conditional
results are shown in Figure 2(b). a per-style-per-model method, which GAN to predict the coarse glyph shapes
means it needs retraining when facing and an ornamentation network to pre-
Controllable text effect transfer a new style. dict the color and texture of the final
Text is significantly different from and glyphs. The two networks are trained
more structured than nontext images. Controllable dynamic text jointly and specialized for each type-
When using arbitrary artistic images as effect transfer face using a very small number of
style references, the glyph should Shortly after the work in [10], Yang et observations. One limitation is that
deform to better resemble the style al. extended the previous method to the MC-GAN is specially designed for pro-
subject, but overdeformation degrades dynamic text effects transfer algorithm cessing English letters, and the input is
the legibility. In other words, there is a Shape-Matching GAN++ [11]. Shape- limited to concatenated images of Eng-
tradeoff between glyph legibility and Matching GAN++ characterizes the lish letters. Besides, the ornamentation
stylistic degree. To find a delicate bal- short-term consistency of motion pat- network performs in a per-style-per-
ance, Yang et al. [10] proposed a con- terns via shape matching within consec- model fashion, which requires retrain-
trollable artistic text style transfer utive frames. By repeatedly predicting ing for every new style.
algorithm, Shape-Matching GAN, the next frame according to a few previ-
which supports real-time control of ously generated frames, it achieves Few-shot transfer for arbitrary glyph
glyph deformations. There are two effective long-term consistency. Shape- Some language systems (e.g., Chinese,
main challenges for glyph deforma- Matching GAN++ can generate appeal- whose official character set GB18030
tions: how to change the text shapes to ing artistic text animation that contains 27,533 characters) contain
match the reference style and how to characterizes large-scale motion pat- massive characters that disable charac-
control the deformation degree. For terns while preserving temporal consis- ter-set-specified methods like MC-
the former, the authors proposed a tency at the same time. It shares the GAN [12]. To overcome this limitation,
novel bidirectional shape-matching same limitations as Shape-Matching Gao et al. developed the first few-shot
strategy, which establishes a shape GAN [10]. learning algorithm, AGIS-Net [13], to
mapping between the source style and transfer both shape and texture styles to
target text through both backward and Joint font and text effect arbitrarily large numbers of characters
forward transfers. The authors first transfer methods and generate high-quality synthesis
simplified the style input into a The above-mentioned algorithms, results. The proposed AGIS-Net is a
sketchy structure S, named backward whether nondeep or deep based, focus simple yet effective model that exploits
structure transfer. Then, models are mainly on transferring artistic effects two parallel encoder–decoder branches.
trained to map the sketchy S to the (such as color distributions and texture) Apart from this, Gao et al. additionally
original S and further to Sl to learn the while keeping the font unchanged. designed a novel and computationally

14 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


efficient local texture refinement loss, ence, FET-GAN can effectively trans- first four datasets, MC-GAN-Gray-
which is helpful in improving the quali- late the original effects of an image to Scale [12], MC-GAN-Color [12],
ty of synthesis results. Example results the referenced effects while maintain- AGIS-Net-C [13], and AGIS-Net-P
are demonstrated in Figure 2(d). With ing the global structure unchanged. A [13], consider fonts along with text
only four reference images available, few-shot fine-tuning strategy is addi- effects. For the second aspect, only
AGIS-Net successfully transfers both tionally designed for generalizing to TextEffects-Decor [9] collects effects
fonts and effects and generates appeal- unseen effects. Except for the superior with decorative elements; examples
ing word art. synthesis performance, another exclu- are shown in Figure 2(b). For the third
sive advantage is that FET-GAN can be aspect, TET-GAN [8], TE141K-C [15],
Few-shot transfer for glyphs generalized to nontext objects of any and TE141K-S [15] contain characters
and nontext objects shape since it considers typeface and of multiple languages, while others
MC-GAN [12], AGIS-Net [13], and other effects as a whole. only collect either English letters or
other related researches mostly treat Chinese characters. Among all of the
typeface and visual effects as two iso- Dataset and evaluation datasets, AGIS-Net-P has the widest
lated attributes, thus requiring multiple character types, and MC-GAN-Color
subnetworks to separately transfer Benchmark datasets has the largest number of text styles.
shape styles and texture styles. From There are multiple available bench-
another perspective, Li et al. [14] first mark datasets for text style transfer, as Performance evaluation
proposed that the typeface should be summarized in Table 1. These datasets Assessing the quality of stylized text is
considered with the visual effects as a differ mainly in three major aspects: 1) of great value in allowing users to effi-
whole. On this basis, Li et al. devel- whether font migration is taken into ciently search for high-quality results
oped the simple unified framework consideration, 2) whether the reference as well as guiding the designing of text
FET-GAN [14], which contains only text style contains special elements, style transfer algorithms. However, not
one encoder–decoder branch. Given a and 3) the kinds of character types that enough attention has been paid to
few samples in the same style for refer- are involved. For the first aspect, the quality assessment. The evaluation of

Table 1. A summary of the benchmark datasets for the text effect transfer task.

Dataset Type Style Glyph Style Type Glyph Type Images


MC-GAN-GrayScale [12] Joint font 10,000 26 Gray-scale fonts English letters 260,000
and effects for English letters
MC-GAN-Color [12] Joint font 20,000 26 Colorful fonts for English letters 520,000
and effects English letters
AGIS-Net-C [13] Joint font 2,460 639 Synthetic artistic fonts Chinese characters 1,571,940
and effects
AGIS-Net-P [13] Joint font 35 7,326 Professional-designed fonts Chinese characters 256,410
and effects
TET-GAN [8] Text effects 64 837 Text effects collected 775 Chinese characters, 53,568
from websites 52 English alphabets,
10 Arabic numerals
TextEffects-Decor [9] Text effects 60 988 Text effects with 52 English letters of 59,280
decorative elements 19 different fonts
TE141K-E [15] Text effects 67 988 Professional-designed 52 English letters of 66,196
text effects 19 different fonts
TE141K-C [15] Text effects 65 837 Professional-designed 775 Chinese characters, 54,405
text effects 52 English alphabets,
10 Arabic numerals
TE141K-S [15] Text effects 20 1,024 Professional-designed 56 special symbols, 20,480
text effects 968 letters in Japanese,
Russian, and so on

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 15


text style transfer algorithms still and algorithms have already led to ground images. Combining this
remains an open and important prob- many successful industrial applications method with visual–textual layout
lem in this field. and begun to deliver commercial bene- generation [18] may enable automatic
In general, there are two major types fits. In this section, we summarize poster design.
of evaluation methodologies: qualita- some applications and propose some
tive evaluation and quantitative evalua- potential usages. Future challenges
tion. The most widely used qualitative Although the advances in the field of
evaluation is the aesthetic judgments of Choose from style libraries text style transfer are inspiring, and the
observers. Most research studies evalu- To ensure the visual quality of gener- current algorithms are capable of gener-
ate different algorithms by comparing ated text with various artistic effects, ating satisfactory results, there are still
their corresponding user preference most existing word-art creation tools several challenges and open issues. In
ratio. The evaluation results are related do not support user-specified styles. this section, we summarize some key
to many factors (e.g., the age, gender, Instead, they construct a large style challenges and discuss possible strate-
and occupation of participants). There- library and allow users to choose gies to solve them.
fore, this metric is subjective, and it is styles according to their preferences.
difficult to reproduce a user study result. Microsoft Office provides users with Evaluation methodology
For quantitative evaluation, there is convenient operation and a large num- Performance evaluation for stylized
­currently no appropriate metric. TE141K ber of fonts with special effects. text is an important issue and is
[15] and other studies mainly choose Adobe Spark, a more powerful tool, becoming increasingly critical as text
widely used image quality assessment contains dynamic effects and more effect transfer algorithms constantly
metrics as an alternative, such as the professional-designed styles. It also develop. On the one hand, there is no
structural similarity index, peak signal- supports creating customized posters standard benchmark dataset. Existing
to-noise ratio, and perceptual loss [16]. with stylized text. datasets suffer from insufficient style
These metrics are designed for natural or character types. Moreover, different
images and therefore cannot effectively Create with user-specified style researchers typically use their own
evaluate the quality of text effect trans- For a broader application scope, another collected datasets, making it difficult
fer results. kind of tool focuses on synthesizing to compare the performance of differ-
To make up for the lack of quanti- artistic text with arbitrary user-specified ent text effect transfer methods. More
tative evaluation metrics in this field, styles. For fonts, FlexiFont supports efforts in collecting a large-scale
Yan et al. [17] provided a solution that constructing a user-specified Chinese benchmark dataset are necessary. On
can automatically assess the quality typeface with several samples provided the other hand, as stated in the section
of stylized glyphs, where a multitask by users. For text effects, because of the “Performance Evaluation,” existing
attentive network is proposed to imitate difficulty in collecting datasets and the assessment metrics are not appropriate
the human visual evaluation process. limited number of reference works, for text effect transfer. User studies
The network is trained to accomplish there are no popular applications that are frequently adopted, but the result
multiple tasks of style autoencoding, apply the technique of transferring arbi- varies greatly among different observ-
content autoencoding, stylization, and trary reference text effects. However, ers and is hard to reproduce. More
destylization, through which it learns with the development of text effect reliable evaluation criteria are needed
to characterize robust style features transfer algorithms and increasing user to assess the performance of text
and content features. Furthermore, demand, we believe that it will be a effect transfer approaches.
visual attention modules are incorpo- promising potential creation tool in the
rated to simulate the process of human future. Some other fully functional tools Time complexity and
high-level visual judgment, paying can also serve as an alternative; e.g., generalization ability
more attention to areas of interest. Photoshop and Adobe Spark are capable The existing methods cannot achieve a
This model can serve as an effective of creating customized stylized text high processing speed and strong gen-
tool to assist quality assessment. The under simple user operations. Another eralization ability at the same time.
limitation is that the text effects are too potential usage is to assist painters and The nondeep algorithms introduced in
diverse to cover with one dataset. As a designers in creating word art, especial- the section “Nondeep Text Effect
result, this method can only handle a ly when working on computer-made Transfer Methods” have slow process-
limited number of styles and therefore artistic text. ing speeds caused by the iterative
lacks applicability. optimization process. The deep-based
Graphic design feedforward methods in the sections
Applications Artistic text effect transfer can also be “Deep-Based Text Effect Transfer
Because of the visually plausible styl- potentially used in graphic design. UT- Methods” and “Joint Font and Text
ized results and wide application sce- Effect [3] provides an effective tool for Effect Transfer Methods” all learn text
narios, the text style transfer research combining stylized text and back- effects from training data and thus

16 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


are incapable of transferring unseen is currently a postdoctoral research Anal. Mach. Intell., vol. 29, no. 3, pp. 463–476,
2007, doi: 10.1109/TPAMI.2007.60.
styles without specialized fine-tuning fellow with the Artificial Intelligence
[2] S. Yang, J. Liu, Z. Lian, and Z. Guo, “Awesome
schemes. To achieve both a low time Corporate Laboratory, Nanyang Tech- typography: Statistics-based text effects transfer,” in
complexity and strong generalization nologica l Un iversit y, Singapore Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recognit., 2017, pp. 7464–7473.
ability, perhaps a feedforward algo- 639798 Singapore. He was a visiting [3] S. Yang, J. Liu, W. Yang, and Z. Guo, “Context-
rithm that can generalize to arbitrary scholar with Texas A&M University aware unsupervised text stylization,” in Proc. ACM
Int. Conf. Multimedia, 2018, pp. 1688–1696, doi:
styles could be designed with more from September 2018 to September 10.1145/3240508.3240580.
flexible style transfer modules and 2019. He received the IEEE Interna- [4] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure
larger datasets. tional Conference on Multimedia and extraction from texture via relative total variation,”
ACM Trans. Graph., vol. 31, no. 6, pp. 1–10, 2012,
Expo 2020 Best Paper Award and the doi: 10.1145/2366145.2366158.
Conclusions IEEE Inter national Workshop on [5] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P.
Over the past several years, the topic Multimedia Signal Processing 2015 Fua, and S. Süsstrunk, “SLIC superpixels com-
pared to state-of-the-art superpixel methods,” IEEE
of text style transfer has gained wide Top10% Paper Award. His current re- Trans. Pattern Anal. Mach. Intell., vol. 34, no.
attention and become an inspir- search interests include image styliza- 11, pp. 2274–2282, 2012, doi: 10.1109/TPAMI.
2012.120.
ing research area. Numerous works, tion and image generation. He is a [6] Y. Men, Z. Lian, Y. Tang, and J. Xiao, “A common
whether optimization based or deep Member of IEEE. framework for interactive texture transfer,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2018,
based, have been proposed to achieve Wenjing Wang (daooshee@pku. pp. 6353–6362.
surprising results in each subtopic. edu.cn) received her B.S. degree in data [7] Y. Men, Z. Lian, Y. Tang, and J. Xiao, “DynTypo:
Despite the great progress and science from Peking University, Example-based dynamic text effects transfer,” in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2019,
achievements in recent years, the area Beijing, 100080 China, in 2019, where pp. 5870–5879, doi: 10.1109/CVPR.​2019.00602.
of text style transfer is far from matu- she is currently pursuing her Ph.D. [8] S. Yang, J. Liu, W. Wang, and Z. Guo, “TET-
GAN: Text effects transfer via stylization and destyl-
rity. A lack of reliable evaluation met- degree at the Wangxuan Institute of ization,” in Proc. AAAI Conf. Artif. Intell., 2019, pp.
rics, inconsistent benchmark datasets, Computer Technology. Her current 1238–1245, doi: 10.1609/aaai.v33i01.33011238.
and other issues still remain to be research interests include image [9] W. Wang, J. Liu, S. Yang, and Z. Guo,
“Typography with decor: Intelligent text style trans-
solved in future research. We believe enhancement, image synthesis, and fer,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
that subsequent studies will solve deep learning. She is a Student Member Recognit., 2019, pp. 5889–5897.
these existing problems and continu- of IEEE. [10] S. Yang, Z. Wang, Z. Wang, N. Xu, J. Liu, and
Z. Guo, “Controllable artistic text style transfer via
ing to contribute to this developing Jiaying Liu ([email protected]. shape-matching GAN,” in Proc. Int. Conf. Comput.
research area. cn) received her Ph.D. degree in com- Vis., 2019, pp. 4442–4451.
[11] S. Yang, Z. Wang, and J. Liu, “Shape-matching
puter science from Peking University. GAN++: Scale controllable dynamic artistic text style
Acknowledgment She is currently an associate professor transfer,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 7, pp. 3807–3820, 2021, doi: 10.1109/
This work was supported by the (Boya Young Fellow) with the Wangx- TPAMI.2021.3055211.
National Natural Science Foundation of uan Institute of Computer Technology, [12] S. Azadi, M. Fisher, V. G. Kim, Z. Wang, E.
China under Contract 62172020 and Peking University, Beijing, 100080 Shechtman, and T. Darrell, “Multi-content GAN for
few-shot font style transfer,” in Proc. IEEE Int. Conf.
was a research achievement of Key China. She has authored numerous arti- Comput. Vis. Pattern Recognit., 2018, pp. 7564–7573.
Laboratory of Science, Technology, and cles and holds 60 patents. She has [13] Y. Gao, Y. Guo, Z. Lian, Y. Tang, and J. Xiao,
Standard in Press Industry (Key Labo- served on the Multimedia Systems and “Artistic glyph image synthesis via one-stage few-shot
learning,” ACM Trans. Graph., vol. 38, no. 6, pp.
ratory of Intelligent Press Media Tech- Applications Technical Committee and 1–12, 2019, doi: 10.1145/3355089.3356574.
nology). The corresponding author is Visual Signal Processing and Commu- [14] W. Li, Y. He, Y. Qi, Z. Li, and Y. Tang, “FET-GAN:
Font and effect transfer via k-shot adaptive instance
Jiaying Liu. nications Technical Committee in the normalization,” in Proc. AAAI Conf. Artif. Intell.,
IEEE Circuits and Systems Society. 2020, pp. 1717–1724, doi: 10.1609/aaai.v34i02.5535.
Authors She has been associate editor of IEEE [15] S. Yang, W. Wang, and J. Liu, “TE141K: Artistic
text benchmark for text effect transfer,” IEEE Trans.
Xinhao Wang ([email protected]) Transactions on Image Processing, Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3709–
received his B.S. degree in intelligence IEEE Transactions on Circuits and Sys- 3723, 2020, doi: 10.1109/TPAMI.2020.​2983697.
[16] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual
science at the Wangxuan Institute of tems for Video Technology, and Journal losses for real-time style transfer and super-resolution,”
C o m p u t e r Te c h n o l o g y, P e k i n g of Visual Communication and Image in Proc. Eur. Conf. Comput. Vis., 2016, pp. 694–711,
doi: 10.1007/978-3-319-46475-6_43.
University, Beijing, 100080 China, in Representation. Her current research
[17] K. Yan, S. Yang, W. Wang, and J. Liu, “Multitask
2022. His current research interests interests include multimedia signal attentive network for text effects quality assessment,” in
include style transfer and deep learning. processing, compression, and com- Proc. IEEE Int. Conf. Multimedia Expo, 2020, pp.
1–6, doi: 10.1109/ICME46284.​2020.9102871.
Shuai Yang ([email protected]. puter vision. She is a Senior Member
[18] X. Yang, T. Mei, Y.-Q. Xu, Y. Rui, and S. Li,
sg) received his B.S. and Ph.D. de- of IEEE. “Automatic generation of visual-textual presentation lay-
grees (Hons.) in computer science out,” ACM Trans. Multimedia Comput., Commun.,

from Peking University, Beijing, Chi- References Appl., vol. 12, no. 2, pp. 1–22, 2016, doi: 10.1145/2818709.
[1] Y. Wexler, E. Shechtman, and M. Irani, “Space-
na, in 2015 and 2020, respectively. He time completion of video,” IEEE Trans. Pattern
 SP

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 17


Lei Cheng , Feng Yin , Sergios Theodoridis , Sotirios Chatzis , and Tsung-Hui Chang

Rethinking Bayesian
Learning for Data Analysis
The art of prior and inference in sparsity-aware modeling

S
parse modeling for signal processing and ma-
chine learning, in general, has been at the fo-
cus of scientific research for over two decades.
Among others, supervised sparsity-aware
learning (SAL) consists of two major paths paved
by 1) discriminative methods that establish direct
input–output mapping based on a regularized cost
function optimization and 2) generative methods that
learn the underlying distributions. The latter, more
widely known as Bayesian methods, enable uncer-
tainty evaluation with respect to the performed pre-
dictions. Furthermore, they can better exploit related
prior information and also, in principle, can natural-
ly introduce robustness into the model, due to their
unique capacity to marginalize out uncertainties re-
lated to the parameter estimates. Moreover, hyper-
parameters (tuning parameters) associated with the
adopted priors, which correspond to cost function
regularizers, can be learned via the training data and
not via costly cross-validation techniques, which is,
in general, the case with the discriminative methods.
To implement SAL, the crucial point lies in the
choice of the function regularizer for discriminative
methods and the choice of the prior distribution for
Bayesian learning. Over the past decade or so, due
to the intense research on deep learning, emphasis
has been put on discriminative techniques. Howev-
er, a comeback of Bayesian methods is taking place
that sheds new light on the design of deep neural
networks (DNNs), which also establish firm links
with Bayesian models, such as Gaussian processes
(GPs), and also inspire new paths for unsupervised learning, recent advances in incorporating sparsity-promoting priors
such as Bayesian tensor decomposition. The goal of this arti- into three highly popular data modeling/analysis tools, name-
cle is two-fold. First, it aims to review, in a unified way, some ly, DNNs, GPs, and tensor decomposition. Second, it reviews
their associated inference techniques from different aspects,
including evidence maximization via optimization and varia-
Digital Object Identifier 10.1109/MSP.2022.3198201
Date of current version: 27 October 2022 tional inference (VI) methods. Challenges, such as the small

18 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 1053-5888/22©2022IEEE


data dilemma, automatic model structure search, and natural that describe the unknown parameters, given the observed data,
prediction uncertainty evaluation, are also discussed. Typical and thus provide a generative mechanism that models the ran-
signal processing and machine learning tasks are considered, dom process that generates the data.
such as time series prediction, adversarial learning, social For the newcomers to machine learning, the discriminative
group clustering, and image completion. Simulation results (also referred to as cost function optimization) perspective might
corroborate the effectiveness of the Bayesian path in address- be more straightforward. It first formulates a task that quanti-
ing the aforementioned challenges and its outstanding capabil- fies the overall deviation between the observed target data and
ity of matching data patterns automatically. the model predictions and then solves it for the point param-
eter estimates via an optimization algorithm. On the contrary,
Introduction the generative (Bayesian) perspective, which aims to reveal the
Over the past three decades or so, machine learning has been generative process and the statistical properties of the observed
gradually established as the umbrella name to cover methods data, sounds more complicated due to some “jargon” terms,
whose goal is to extract valuable information and knowledge such as prior, likelihood, posterior, and evidence. Nevertheless,
from data and then use it to make predictions [1]. Machine machine learning under the Bayesian perspective is gaining
learning has been extensively applied to a wide range of disci- in popularity recently due to the comparative advantages that
plines, such as signal processing, data mining, communications, spring from the nature of the statistical modeling and the extra
finance, biomedicine, and robotics, to name but a few. The ma- information returned by the posterior distributions. This article
jority of the machine learning methods first rely on adopting a aims at demystifying the philosophy that underlies the Bayesian
techniques and then reviewing, in
a unified way, recent advances of
Bayesian SAL for three analysis
tools of high current interest.
In the Bayesian framework,
model sparsity is implemented
via sparsity-promoting priors that
lead to automatic model determi-
nation by optimally sparsifying
an originally overparameterized
model. The goal is to optimally
predict the order of the system
that corresponds to the best trad-
eoff between accuracy and com-
plexity, with the aim to combat
overfitting, in line with the gen-
eral concept of regularization.
However, in Bayesian learning,
all the associated (hyper)param-
eters, which control the degree of
regularization, can be optimally
obtained via the training set dur-
ing the learning phase. It is hoped
that this article can help newcom-
©SHUTTERSTOCK.COM/FR/G/INGA+RA

ers grasp the essence of Bayesian


learning and, at the same time,
provide experts with an update of
some recent advances developed
for different data modeling and
analysis tasks. In particular, we
focus on Bayesian SAL for three
popular data modeling and analy-
parametric model to describe the data at hand and then an in- sis tools, namely, DNNs, GPs, and tensor decomposition, that
ference/estimation technique to derive estimates that describe have promoted intelligent signal processing applications. Some
the unknown model parameters. In the discriminative methods, typical examples are as follows.
point estimates of the involved parameters are obtained via cost On the supervised learning front with overparameterized
function optimization. In contrast, by practicing the Bayesian DNNs, novel data-driven mechanisms have been proposed
philosophy, one can infer the underlying statistical distributions in [2], [3], [4], [5], and [6] to intelligently prune redundant

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 19


neuron connections without human assistance. In a similar vein, that, although being powerful, had never been used before in
in [7], [8], and [9], sparsity-promoting priors have been used in our target models. On the other hand, we showcase some re-
the context of the GPs that give rise to optimal and interpretable cent developments of the associated inference algorithms. For
kernels that are capable of identifying a sparse subset of effec- readers with different backgrounds and familiarity with Bayes-
tive frequency components automatically. On the unsupervised ian statistics, we provide a road map in Figure 1 to facilitate
learning front, some advanced works on tensor decomposi- their reading.
tion, e.g., [10], [11], [12], [13], [14], and [15], have shown that The remaining sections of this article are organized as fol-
sparsity-promoting priors are able to unravel the few underlying lows. In the “Bayesian Learning Basics” section, we intro-
interpretable components in a completely tuning-free fashion. duce some Bayesian learning basics, aiming to let the readers
Such techniques have found various signal processing applica- easily follow the main concepts, jargon terms, and math
tions, including data classification [2], [5], [6], adversarial learn- notations. In the “Sparsity-Aware Learning: Regularization
ing [3], [4], time series prediction [7], [8], [9], [16], blind source Functions and Prior Distributions” section, we first review
separation [10], [13], [17], image completion [12], [14], [15], two different paths (namely, the regularized optimization and
and wireless communications [18]. Bayesian paths) and further introduce some sparsity-promot-
The aforementioned references address the following two ing priors along the Bayesian path. In the “The Art of Prior:
state-of-the-art challenges: Sparsity-Aware Modeling for Three Case Studies” section,
1) The art of prior: How should the fundamental sparsity-pro- we demonstrate how to integrate the introduced sparsity-
moting priors be chosen and tailored to fit modern data- promoting priors into three prevailing data analysis tools, i.e.,
driven models with complex structures? the DNNs, GPs, and tensor decomposition. For the reviewed
2) The art of inference: How can recent optimization theory sparsity-aware learning models, we further introduce their
and stochastic approximation techniques be leveraged to associated inference methods in the “The Art of Inference:
design fast, accurate, and scalable inference algorithms? Evidence Maximization and Variational Approximation” sec-
This tutorial-style article aims to give a unified treatment of tion. Various signal processing applications of high current
the underlying common ideas and techniques to offer concrete interests enabled by the aforementioned models are exempli-
answers to the preceding questions. It is the goal of this article fied in the “Applications in Signal Processing and Machine
to provide a comprehensive review of such sparsity-promoting Learning” section. Finally, we conclude the article and bring
techniques. On the one hand, we introduce some newly pro- up some potential future research directions in the “Conclud-
posed sparsity-promoting priors as well as various salient ones ing Remarks” section.

Applications in Signal Processing


and Machine Learning Section

The Art of Prior: The Art of Inference:


Evidence
Sparsity-Aware Tensor
GPs Maximization
Modeling for DNNs
Decomposition and Variational
Three Case Approximation
Studies Section Section
Recent Tools
Philosophy

Bayesian Methods

Regularized
Optimization
Parametric Nonparametric
Models Models

SAL: Regularization Functions and Prior Distributions Section

Bayesian Philosophy

Linear Parametric Nonlinear Nonparametric


Regression Regression

Bayesian Learning Basics Section

FIGURE 1. The organization of this article and a road map for the readers.

20 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Bayesian learning basics likelihood function p M (D ; i), which are omitted for nota-
We first provide some touches on the philosophy of Bayesian tion brevity.
learning in the “Bayesian Philosophy Basics” section and use Bayes’ theorem solves for the inverse problem that is associ-
Bayesian linear regression as an example to elucidate different ated with any machine learning task. The forward problem is an
symbol notations, terminology, and unique features of Bayes- easy one. Given the model M and the values of the associated
ian learning in the “Bayesian Linear Parametric Regression: A parameters i, one can easily generate the output observations
Pedagogic Example” section. Then, we discuss extensions to the D from the conditional distribution p M (D ; i). The task of
nonlinear and nonparametric cases, shedding light on the con- machine learning is the opposite and a more difficult one. Given
nections between simple linear regression and advanced GP re- the observed data D, the task is to estimate/infer the model
gression, in the “Bayesian Nonlinear Nonparametric Model: GP M. This is known as the inverse problem, and Bayes’ theorem
Regression Example” section. applied to the machine learning task does exactly that. It relates
the inverse problem (posterior) to the forward one (likelihood).
Bayesian philosophy basics All one needs for this “update” is to assume a prior and also to
obtain an estimate of the distribution associated with the data,
Bayes’ theorem which constitutes the denominator in (1). The latter term and the
Let D be the observed (training) dataset and M be the under- related information are neglected in the discriminative models;
lying model that is assumed to generate the data. For simplic- hence, important information is not taken into account (see the
ity, we start our treatment with models that are parameterized in discussion in, e.g., [1] and [19]).
terms of a set of unknown parameters i ! R L # 1, where R is the Occasionally, we may need a point estimate of the model
set of real numbers. By the definition of parametric models, the parameters as the intermediate result, and there are two com-
dimension L is preselected and fixed. According to the Bayesian monly used estimates that can be computed from the posterior
philosophy, these parameters are treated as random variables. distribution, p M (i ; D; h). Assuming that i is known or an
Their randomness does not imply a random nature of these pa- estimate is available, the first one is known as the maximum a
rameters but essentially encodes our uncertainty with respect to posteriori (MAP) estimate, and the other one as the minimum
their true (yet unknown) values; see related discussions in, e.g., mean-square error (MSE) estimate; concretely,
[1]. First, in Bayesian modeling, we assume that the set of un-
known random parameters is described by a prior distribution, it MAP = argmax p M (i ; D; h) (2)
i
i.e., i + p M (i; h p), which encodes our prior belief in i; that
is, it encodes our uncertainty prior to receiving the dataset D. it MMSE = # i $ p M (i ; D; h) di.(3)
As we see soon, this corresponds to regularizing the learning
task since it will bias the solution that we seek toward certain Evidence maximization for hyperparameter learning
regions in the parameter space. The prior p M (i; h p) is specified In the prior distribution, p M (i; h p), the hyperparameters, h p,
via a set of deterministic yet unknown hyperparameters (tuning could be either preselected according to the side information at
parameters) stacked together in a vector and denoted by h p . hand or learned from the observed dataset D. In Bayesian learn-
The second quantity that is assumed to be known is the con- ing, the latter path is followed favorably. One popular alterna-
ditional distribution that describes the data, given the values of tive is to select the full set of hyperparameters h to be the most
the parameters, i, which, for the specific observed dataset D, compatible with the observed dataset D, which can be naturally
is known as the likelihood p M (D ; i). (Throughout this article, formulated as the following so-called evidence maximization:
we use “;”, i.e., p (x; h), if h is a deterministic parameter to be
max log p M (D; h) (4)
optimized or preselected by the user, and we use “;” in p (x; h) h

if h is a random variable or a hyperparameter treated as a ran- where


dom variable, that is, if the distribution is conditional on another p M (D; h) = # p M (D ; i) p M (i; h p) di (5)
random variable.)
Having selected the likelihood and the prior distribution is known as the model evidence since it measures the plausibil-
function, the goal of Bayesian inference is to infer (estimate) ity of the dataset D, given the hyperparameters h. Note that
the posterior distribution of the parameters, given the observa- the evidence depends on the model itself and not on any spe-
tions, i.e., p M (i ; D; h), that constitutes the update of the prior cific value of the parameters i, which have been integrated out
assumption encoded in p M (i; h p) after digesting the dataset (marginalized). This is a crucial difference compared with the
D. This process can be elegantly described by the celebrated discriminative methods. As can be shown, the evidence maxi-
Bayes’ theorem: mization problem (4) involves a tradeoff between accuracy (the
achieved likelihood value) and model complexity, in line with
p M (D ; i) p M (i; h p) Occam’s razor [1], [20]. This allows the computation of the
p M (i ; D; h) = .(1)
p M (D; h) model hyperparameters h directly from the observed dataset D.
At this point, recall that one of the major difficulties associated
Note that h includes both the hyperparameters associated with with machine learning, and the inverse problems, in general, is
the prior, h p, and some extra hyperparameters involved in the overfitting. That is, if the model is too complex with respect to

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 21


the number of training data samples, then the estimated models First, incorporating a prior for each unknown model param-
learn the specificities of the given training data but cannot gener- eter/function enables one to naturally encode a desired struc-
alize well when dealing with new unseen (test) data. ture into Bayesian learning. As demonstrated in the rest of this
The use of regularization in the discriminative methods article, a prior can be imposed on both parametric models with
and priors in the Bayesian ones try to achieve the best tradeoff a fixed number of unknown parameters and nonparametric
between accuracy (fitting to the observed data) and generaliza- models that comprise unknown functions and/or an unknown
tion that heavily depends on the complexity of the model; see, set of parameters whose number is not fixed but varies with the
e.g., [1] and [19] for further discussions. Furthermore, note that size of the dataset. Second, through evidence maximization,
in the Bayesian context, model complexity is interpreted from a one can optimize the set of hyperparameters that is associated
broader view since it depends not only on the number of param- with the selected Bayesian learning model to obtain enhanced
eters but also on the shape (e.g., variance and skewness) of the generalization performance. Finally, marginalization ensures
involved distributions of i; see, e.g., [1] and [19] for in-depth robust prediction and generalization performance by averaging
discussions. For example, under a broad enough Gaussian prior over an ensemble of predictions, using all possible parameter/
for the model parameters, i, and some limiting properties, it function estimates weighted by the corresponding posterior
can be shown that the evidence in (5) results in the well-known probability. These three aspects are discussed in detail in the
Bayesian information criterion for model selection [1], [21], following sections.
which has the form
Bayesian linear parametric regression:
log p M (D; h) = log p M (D ; it MAP) - L log N (6) A pedagogic example
2
Before moving to our next topics on more advanced Bayesian
where the first term on the right-hand side is the accuracy (the data analysis, we introduce the Bayesian linear regression model
best likelihood fit) term and the second is the complexity term as an example to further elaborate the terminology and concepts
that “competes” in a tradeoff fashion while maximizing the discussed previously. It also serves as the cornerstone for the two
evidence; see, e.g., [1] and [22] for further discussion. In (6), recent supervised learning tools, namely, the Bayesian NNs and
L denotes the number of unknown parameters in i, and N is GP models elaborated in the following sections.
the size of the training data. A more recent interpretation of this
tradeoff, in the context of overparameterized DNNs, is provided Linear regression
in [23], where the prior is viewed as the inductive bias that fa- In statistics, the term regression refers to seeking the relation-
vors certain datasets. ship between a dependent random variable, y, which is usually
considered the response of a system, and the associated input/
Marginalization for prediction independent variables, x = [x 1, x 2, f, x L]T . When the system is
The learned posterior p M (i ; D; h) provides uncertainty infor- modeled as a linear combiner with an additive disturbance or
mation about i, i.e., the plausibility of each possible i to be en- noise term v n, the relationship between y n and x n of the nth
dorsed by the observed dataset D, and it can be used to forecast data sample can be expressed as
an unseen dataset, D new, via marginalization:
y n = i T x n + v n, 6n ! " 1, 2, f, N , (8)
p M (D new ; D; h) = # p M (D new ; i) p M (i ; D; h) di.(7)
which specifies the linear regression task. For simplicity,
From (7), Bayesian prediction can be interpreted as the weight- we assume that the additive noise terms " v n , are indepen-
ed average of the predicted probability p M (D new ; i) among all dent identically distributed (i.i.d.) Gaussian with zero mean
and variance b -1; i.e., " v n , ~ N (v n; 0, b -1), where b
i.i.d.
possible model configurations, each of which is specified by
different model parameters, i, and weighted by the respective (i.e., the inverse of the variance) is called precision in sta-
posterior p M (i ; D; h) [in this article, the unseen dataset D new tistics and machine learning. The task of linear regression
is assumed to be statistically independent of the training data- is to learn the weight parameters i = [i 1, i 2, f, i L]T from
set D. Therefore, p M (D new ; i) = p M (D new ; D, i)] . In other the training/observed dataset D _ " X, y ,, where the input
words, prediction does not depend on a specific point estimate of matrix X _ 6x 1, x 2, f, x N@T ! R N # L and the output vector
the unknown parameters, which equips Bayesian methods with y _ 6y 1, y 2, f, y N@T .
great potential for more robust predictions against the estimation
error of i; see, e.g., [1] and [19]. Bayesian learning
In summary, in light of Bayes’ theorem, the four quanti- For the linear regression task, we take a Bayesian perspective by
ties [i.e., prior p M (i; h p), likelihood p M (D ; i), posterior treating the unknown parameters i as a random vector. As intro-
p M (i ; D; h), and evidence p M (D; h)] give a new perspective duced in the “Bayesian Philosophy Basics” section, the inverse
on the inverse problem. The resulting method combines the problem can be solved via Bayes’ theorem after specifying the
strength of the selected priors and the likelihood of the observed following four quantities:
data to provide a corresponding posterior. The success of such ■■ Likelihood: The easiest one to derive is the likelihood func-
an inference process strongly relies on the following three steps. tion, which describes the forward problem of linear regression.

22 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Owing to the Gaussian and independence properties of the (7), the posterior/predictive distribution for a novel input
noise terms " v n ,, the following Gaussian likelihood func- x ) is
tion can be easily obtained:

p M ^ y ) ; " X, y ,h = # N ^ y ); i T x ), b -1hN^i; n, Rhdi


N ^ y n; i T x n, b -1 h .(9)
N
p M (D ; i) = %
= N ^ y ); n T x ), b -1 + x T) R x )h .

(14)
n =1

■■ Prior: Then, we specify a prior on the unknown parameters The predicted value of y ) can be acquired via n T x ), and
i. For mathematical tractability, we adopt an i.i.d. Gaussian the posterior variance, b -1 + x T) R x ), quantifies the uncer-
distribution as the prior: tainty about this point prediction. Rather than providing a
L point prediction as in the discriminative methods, Bayesian
p M (i; h p) = % N (i l; 0, a l-1) (10) methods advocate averaging all possible predicted values
l =1
via marginalization and are thus more robust against errone-
where a l is the precision associated with each i l and ous parameter estimates.
h p = a _ 6a 1, a 2, f, a L@T represents the hyperparameters
associated with the prior. Bayesian nonlinear nonparametric model:
■■ Evidence: After substituting the prior (10) and the likelihood GP regression example
(9) into (5) and performing the integration, we can derive To improve the data representation power of Bayesian linear
the following Gaussian evidence: parametric models, a lot of effort has been invested in designing
nonlinear and nonparametric models. A direct nonlinear gener-
p M (D; h) = N ^ y; 0, b -1 I + X A -1 X T h (11) alization of (8) is given by

where the diagonal matrix A _ diag " a , and I denotes the y = f (x) + v (15)
identity matrix. Here, we have h = 6h Tp , b@ .
T

■■ Posterior: Inserting the prior (10), the likelihood (9), and the where, instead of the linearity of (8), we employ a nonlinear
evidence (5) into Bayes’ theorem (1), the posterior can be functional dependence f (x) and let v be the noise term, as be-
shown to be the Gaussian distribution fore. Moreover, the randomness associated with the weight pa-
rameters i in (8) is now embedded into the function f (x) itself,
p M (i ; D; h) = N (i; n, R) (12)
which is assumed to be a random process. That is, the outcome/
where realization of each random experiment is a function instead of a
T single value/vector. Thus, in this case, we have to deal with pri-
n = bRX y (13a)
ors related to nonlinear functions directly rather than indirectly,
R = (A + bX T X )-1.(13b) i.e., by specifying a family of nonlinear parametric functions and
placing priors over the associated weight parameters.
Once again, taking the preceding linear regression as a
concrete example, we further demonstrate the merits of GP model
Bayesian learning in general, as follows: In the following, we introduce one representative model that
■■ Merit 1—Parameter learning with uncertainty quantifica- adopts this rationale, namely, the GP model for nonlinear regres-
tion: Using Bayes’ theorem, the posterior in (12) not only sion. The GP models constitute a special family of random pro-
provides a point estimate n in (13a) for the unknown param- cesses, where the outcome of each experiment is a function or a
eters i but also provides a covariance matrix R in (13b) that sequence. For instance, in signal processing, this can be a con-
describes to which extent the posterior distribution is cen- tinuous-time signal f(t) as a function of time t or a discrete-time
tered around the point estimate n. In other words, it quanti- signal f(n) in terms of the sequence index n. In this article, we
fies our uncertainty about the parameter estimate, which treat the GP model as a data analysis tool whose input that acts as
cannot be naturally obtained in any discriminative method. the argument in f ($) is a vector; i.e., vector x = 6x 1, x 2, f, x L@T .
For the preceding example, we have it MAP = it MMSE because For clarity, we give the definition of GP as follows [1], [24]: a
the posterior distribution p M (i ; D; h) follows a unimodal random process, f (x), is called a GP if and only if for any finite
Gaussian distribution. Of course, frequentist methods can number of points, x 1, x 2, f, x N , the associated joint probability
also construct uncertainty region/confidence intervals by tak- density function (pdf), p ^ f (x 1), f (x 2), f, f (x N )h, is Gaussian.
ing a few extra steps once the parameter estimates have been A GP can be considered an infinitely long vector of joint-
obtained. However, the Bayesian method provides, in one ly Gaussian distributed random variables, so it can be fully
go, the posterior distribution of the model parameters, from described by its mean function and covariance function, defined
which both a point estimate as well as the uncertainty region as follows:
can be optimally derived via the learning optimization step.
■■ Merit 2—Robust prediction via marginalization: After m (x) _ E 6 f (x)@ (16)
substituting (12) and (9) tailored to new observations into cov (x, xl ) _ E 6^ f (x) - m (x)h^ f (xl ) - m (xl ) h@ .(17)

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 23


A GP is said to be stationary if the mean function, m (x), is a con- GP kernel functions
stant mean, and, moreover, its covariance function has the fol- As mentioned before, the kernel function plays a crucial role in
lowing simplified form: cov (x, xl ) = cov (x), with x _ x - xl. determining a GP model’s representation power. To shed more
When a GP is adopted for data modeling and analysis, we light on the kernel function, especially on how it represents ran-
need to specify the mean function and the covariance func- dom functions as well as its good physical interpretations, we
tion to make the model match the underlying data patterns. The demonstrate the most widely used (but not necessarily optimal)
mean function is often set to zero, especially when there is no squared-exponential (SE) kernel.
prior knowledge available. The data representation power of
the nonparametric GP models is determined overwhelmingly SE kernel
by the covariance function, which is also known as the kernel The form of this widely used kernel function is
function, due to the positive semidefinite nature of a covariance
< x - xl<22
k (x, xl; h p) = v 2s exp c - m (19)
function. In the following, we use k (x, xl; h p) _ cov (x, xl ) to

represent a preselected kernel function with an explicit set 2, 2
of tuning kernel hyperparameters, h p, for the observed data.
Finally, we say that a function realization is drawn from the where the hyperparameter v 2s determines the magnitude of
GP prior, and we write fluctuation of f (x) and the other hyperparameter ,, called the
length scale, determines the statistical correlation between two
f (x) + GP ^m (x), k (x, xl; h p)h .(18)
points, f (x) and f (xl), separated by a (Euclidean) distance
The consequent GP regression model follows (15), where f (x) d _ < x - xl<2 . Thus, we have the kernel hyperparameters,
2 T
is represented by a GP model defined in (18) and the noise term h p = [v s , , ] , specifically for this kernel.
v is assumed to be Gaussian distributed with zero mean and vari- In Figure 2, we show some sample functions generated from
ance b -1, as in the previous simple Bayesian linear regres- a GP (for 1D input x) involving the SE kernel with different
sion example. hyperparameter configurations. From these illustrations, we
can clearly spot the physical meaning of the SE kernel hyper-
parameters. There are many other classic kernel functions,
such as the Ornstein–Uhlenbeck kernel, rational quadratic ker-
3
nel, periodic kernel, and locally periodic kernel, as introduced
in, e.g., [24]. They can even be combined, for instance, in the
2
form of a linearly weighted sum, to enrich the overall modeling
1
capacity [24], [25].
f (x)

0 Designing a competent stationary kernel function for the GP


–1 model can also be considered in the frequency domain, owing
–2 to the famous Wiener–Khintchine theorem [24]. The theorem
–3 states that the Fourier transform of a stationary kernel function,
–10 –5 0 5 10 k (x), and the associated spectral density of the process, S (s),
x are Fourier duals:
(a)
# S (s) e 2 is # k (x) e -2 is
T T
5 k (x) = r x
ds, S (s) = r x
dx.(20)

Here, it is noteworthy to mention that i is the imaginary unit, and


the operation s T x refers to the inner product of the generalized
f (x)

0
frequency parameters, s, and the time difference parameters, x.
In the “The Art of Prior: Sparsity-Aware Modeling for Three
Case Studies” section, we introduce some optimal kernel design
–5 methods that were first built based on the spectral density in the
–10 –5 0 5 10 frequency domain and then transformed back to the original in-
x put domain.
(b)

Sample 1 Sample 3 GP for regression


Sample 2 Sample 4 In contrast to the Bayesian linear regression model, we set a
GP prior directly on the underlying function in the GP regres-
sion model, namely, f (x) + GP (m (x), k (x, xl; h p)) . Given
FIGURE 2. Sample functions generated from a GP model by using the SE the observed dataset, D = " X, y ,, as defined before, the main
kernel with two different hyperparameter configurations. (a) A GP with
hyperparameters v 2s = 1, , = 5 generates low-peaked and smooth sample goal of GP-based Bayesian nonlinear regression is to compute
functions. (b) A GP with hyperparameters v 2s = 5, , = 0.5 generates the evidence, p ( y; h), for optimizing the model hyperparam-
high-peaked and rapidly varying sample functions. eters, h, and to compute the posterior distribution, p ( y ) ; y),

24 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


of y ) = 6y ),1, y ), 2, f, y ), N )@T evaluated at n ) novel test inputs criminative methods, such as kernel ridge regression, lies in the
X ) = 6x ),1, x ), 2, f, x ), N )@T . natural uncertainty quantification given by (25).
A graphical illustration of GP working on a toy regression
Evidence example is provided in Figure 3. As we can see from the fig-
This can be obtained in a straightforward way due to the re- ures, the uncertainty in the GP prior is constantly large, reflect-
gression model y = f ^ X h + v, where v is independent of ing our crude prior belief in the underlying function. While
the GP model, f (x) + GP (m (x), k (x, xl; h p)), and we let it has been significantly reduced in the neighborhood of the
v + N ^0, b -1 I h . As a consequence, it is easy to derive (see, observed data points in the GP posterior, it still remains com-
e.g., [1], [19], and [24]) that parably large in regions where the observed data points are
scarce. Apart from the representation power of the GP model, it
p ( y; h) = N ^ y; 0, K ^ X, X; h p h + b -1 I h (21) also connects to various other salient machine learning models,
including, for instance, the relevance vector machine (RVM)
where h = 6h Tp , b@ and K ^ X, X; h p h is the N # N kernel
T
and support VM [24]. Also, it has been shown that an NN, with
matrix of f ^ X h _ 6 f (x 1), f (x 2), f, f (x n)@T evaluated for the one or multiple hidden layers, asymptotically approaches a
training samples. Note that the kernel matrix is a square matrix GP; see, e.g., [24] and [26].
whose (ij)th entry is the pairwise covariance between f (x i) and
f (x j), computed according to (17), for any x i and x j in the SAL: Regularization functions and prior distributions
training dataset. The covariance matrix, X A -1 X T , of the Bayes- In modern big data analysis, there is a trend to employ sophis-
ian linear regression function, f (x) = i T x, given in (11) can ticated models that involve an excessive number of parameters
be regarded as one instance of the kernel matrix, K ^ X, X; h p h .
The latter can provide increased representation power through
choosing more appropriate kernel forms and tuning the associ- 3
ated kernel hyperparameters. As shown in the “The Art of Infer- 2
ence: Evidence Maximization and Variational Approximation”
1
section, we maximize this evidence function to get an optimal
f (x)

set of model hyperparameters. 0


–1
Posterior distribution –2
It turns out (see, e.g., [1], [19], and [24]) that the joint distribu-
–3
tion of the training output y and the test output y ) is a Gaussian –10 –5 0 5 10
of the following form: x
Sample 1 Sample 3
K ^ X, X h + b -1 I, K ^ X, X ) h
= G + N e= G; 0, = Go (22)
y y Sample 2 95% CR
y) y) K ^ X ), X h, K ^ X ), X )h + b -1 I (a)

where K ^ X, X )h stands for the N # N ) kernel matrix between


3

the training inputs and test inputs and K ^ X ), X )h for the N ) # N ) 2


kernel matrix among the test inputs. Here, we let K ^ X, X h be a 1
short form of K ^ X, X; h h .
f (x)

0
By applying some classic conditional Gaussian results (see,
–1
e.g., [1], [19], and [24]) we can derive the posterior distribution
from the joint distribution in (22) as –2
–3
p ^ y ) ; y h + N ^ y ); m
r , Vr h (23) –10 –5 0 5 10
x

where the posterior mean (vector) and the posterior covariance Observations Sample 1 Sample 3
(matrix) are, respectively, Mean Function Sample 2 95% CR
(b)
r = K ^ X ), X h6K ^ X, X h + b I@ y (24)
m -1 -1

FIGURE 3. (a) Three sample functions drawn randomly from a GP


Vr = K ^ X ), X )h + b -1 I - K ^ X ), X h6K ^ X, X h + b -1 I@
-1
prior [refer to (18)] with an SE kernel [refer to (19)]. (b) Three sample

# K ^ X, X ) h . (25) functions drawn from the GP posterior [refer to (23), (24), and (25)]
computed based on the prior in (a) as well as four noisy observations,
indicated by the black circles. The corresponding posterior mean function
The preceding posterior mean gives a point prediction, while is depicted by the dark black curve. The gray shaded area represents the
the posterior covariance defines the uncertainty region of such uncertainty region, taken as the 95% confidence region (CR) for both the
prediction. A leading benefit of using the GP models over dis- prior and the posterior herein.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 25


(sometimes even more than the number of data samples, e.g., to balance the tradeoff between the data fitting cost and the regu-
in overparameterized models). This makes the learning sys- larization function for sparse structure embedding. In SAL, it is
tems vulnerable to overfitting to the observed data. Thus, the assumed that the unknown parameters i have a majority of zero
obvious question concerns the right model size given the data entries, and thus the adopted regularization function r (i) should
sample. SAL that promotes sparsity on the structure of the help the optimization process unveil such zeros. Such regular-
learned model constitutes a major path in dealing with such ization functions include the family of , p- norm functions, with
models in a data-adaptive fashion. The term sparsity implies 0 # p # 1, among which the , 1 norm is most popular since it
that most of the unknown parameters are pushed to (almost) retains the computationally attractive property of convexity. Fur-
zero values. This can be achieved either via the combination of thermore, strong theoretical results have been derived; see, e.g.,
a discriminative method and appropriate regularizers or via the [1] and [27]. In recent years, SAL advances via regularized cost
Bayesian path by adopting sparsity-promoting priors. The ma- optimization prevailed in the context of machine learning us-
jor difference between the two paths lies in the way “sparsity” ing data analysis tools. The literature is very rich and fairly well
is interpreted and embedded into the models, as explained in documented with many sparsity-promoting regularization func-
the following sections. In the following, we introduce the first tions. Although the resulting regularized cost function might be
path that leads to SAL via regularized optimization methods in nonconvex and/or nonsmooth, efficient learning algorithms exist
the “SAL via Regularized Optimization Methods” s­ ection, fol- and have been built on solid theoretical foundations in optimiza-
lowed by the SAL via Bayesian methods for parametric mod- tion theory; see, e.g., [28].
els in the “SAL via Bayesian Methods for Parametric Models”
section and nonparametric models in the “SAL via Bayesian SAL via Bayesian methods for parametric models
Methods for Nonparametric Models” section. Note that the Before we enter into a more formal presentation of a family
aim of this article is not to compare the two paths but, rather, of pdfs that promote sparsity, let us first view sparsity from a
to rethink the Bayesian philosophy. slightly different angle. It is well known that there is a bridge
between the estimate obtained from solving problem (26) and
SAL via regularized optimization methods the MAP estimate (see the “Bayesian Philosophy Basics” sec-
Following the regularized optimization way, “sparsity” informa- tion). For example, it is not difficult to see that if r (i) is the
tion is embedded through regularization functions. Using the squared Euclidean norm (i.e., the , 2 norm that gives rise to the
linear regression task as an example, the regularized parameter so-called ridge regression), the resulting estimate corresponds
optimization problem is formulated as to the MAP one when assuming the noise to be i.i.d. Gaussian
and the prior on i to be also of Gaussian form. If, on the other
min 1 / ^ y n - i T x n h +
N
2 hand, r (i) takes the , 1 norm, this corresponds to imposing a La-
5 9
m # r (i)
i 2 n =1 placian prior, instead of a Gaussian one, on i. For comparison,
14444244443 regularization parameter regularization function
data fitting cost Figure 4 presents the Laplacian and Gaussian priors for i ! R 2.
(26)
It is readily seen that the Laplacian distribution is heavy tailed
where the regularization function r (i) steers the solution toward compared to the Gaussian one. In other words, the probability
a preferred sparse structure and the regularization p­ arameter m is that the parameters will take nonzero values for the zero-mean

0.25 1
0.2 0.8
0.15 0.6
p (θ1, θ2)

p (θ1, θ2)

0.1 0.4
0.05 0.2
0 0
5 5
5 5
0 0
θ1 0 θ1 0
θ2 θ2
–5 –5 –5 –5
(a) (b)

FIGURE 4. The joint probability distribution of the model parameters in 2D space. (a) The Laplacian distribution. (b) The Gaussian distribution. The heavy-
tail Laplacian distribution peaks sharply around zero and falls slowly along the axes, thus promoting sparse solutions in a probabilistic manner. On the
contrary, the Gaussian distribution decays more rapidly along both dimensions when compared to the Laplacian distribution.

26 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Gaussian goes to zero very fast. Most of the probability mass tive variances, g l, l = 1, 2, f, L, are also random variables,
concentrates around zero. This is bad news for sparsity since we each one following a prior, p ^g l; h p h, where h p is a set of tun-
want most of the values to be (close to) zero while some of the ing hyperparameters associated with the prior. Thus, the GSM
parameters still have large values. In contrast, observe that in prior for each i l is expressed as
the Laplacian, although most of the probability mass is close to
zero, there is still a high enough probability of nonzero values. p ^ i l; h p h = # N ^i l; 0, g lhp^g l; h phdg l .(27)
More importantly, this probability mass is concentrated along
the axes, where one of the parameters is zero. This is how the By varying the functional forms of p ^g l; h p h, the marginal-
Laplacian prior promotes sparsity. Thus, to practice “Bayesian- ization (i.e., integrating out the dependence on g l) performed in
ism,” one explicit path is to construct priors with heavy tails to light of (27) induces different prior distributions of i. For exam-
promote sparsity. In the following, we introduce an important ple, if p ^g l; h p h is an inverse Gamma distribution, (27) induces
family of such sparsity-promoting priors. a Student’s t distribution [29]; if p ^g l; h p h is a gamma distri-
bution, (27) induces a Laplacian distribution [29]. For clarity,
The Gaussian scale mixture prior Table 1 summarizes different heavy-tail distributions, including
The kickoff point of the Gaussian scale mixture (GSM) prior (see, Normal–Jeffreys, generalized hyperbolic, and horseshoe distri-
e.g., [29]) is to assume that 1) the parameters, i l, l = 1, 2, f, L, butions, among others. To illustrate graphically the sparsity-pro-
are mutually statistically independent, 2) that each one of them moting property endowed by their heavy-tail nature, in addition
follows a Gaussian prior with zero mean, and 3) that the respec- to the Laplacian distribution plotted in Figure 4, we further depict
two representative GSM prior distributions, namely, Student’s t
distribution and the horseshoe distribution, in Figure 5. In
the “Sparsity-Aware Modeling for Bayesian DNNs” and “Spar-
Table 1. The examples of the GSM prior.
sity-Aware Modeling for Tensor Decompositions” sections, we
GSM Prior p (i l) Mixing Distribution p (g l) show the use of the GSM prior in modeling Bayesian NNs and
Student’s t Inverse gamma: low-rank tensor decompositions, respectively.
    p (g l; h p = [a, b]) = IG (g l; a, b) Besides the aforementioned families of sparsity-promoting
Normal–Jeffreys 1 priors, another path that has been followed to impose sparsity
Log uniform: p (g l; h p = [ ]) ?
gl
exploits the underlying property of the evidence function to
Laplacian Gamma: p (g l; h p = [a, b]) = Ga (g l; a, b) provide a tradeoff between the fitting accuracy and the model
Generalized hyperbolic Generalized inverse Gaussian: complexity, at its maximum value, as discussed in the “Bayesian
p (g l; h p = [a, b, m]) = GIG (g l; a, b, m)
Philosophy Basics” section. To this end, one imposes an indi-
vidual Gaussian prior N ^0, g lh on each one of the unknown
Horseshoe g l = x l y l, h p = [a, b]
Half Cauchy: p (x l) = C + (0, a) parameters, which are assumed to be mutually independent,
       p (y l) = C + (0, b)
and then treats the respective variances, g l, l = 1, 2, f, L, as
IG: inverse Gamma; Ga: Gamma; GIG: generalized inverse Gaussian; C+: half Cauchy. hyperparameters that are obtained via the evidence function

0.5
0.02
0.4
0.015
p (θ1, θ2)

0.3
p (θ1, θ2)

0.01 0.2

0.005 0.1

0 0
5 5
5 5
0 0
θ1 0 θ1 0
θ2 θ2
–5 –5 –5
5 –5

(a) (b)

FIGURE 5. The representative GSM prior distributions in 2D space. (a) The Student’s t distribution and (b) the horseshoe distribution. It can be seen that
these two distributions show different heavy-tail profiles and are both sparsity promoting.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 27


optimization. Due to the accuracy–complexity tradeoff, the vari- the rationale of the IBP, the first dimension, say, x 1, is linked to
ances of the parameters that need to be pushed to zero (i.e., they some of the infinite nodes with certain probabilities. Then, the
do not contribute much to the accuracy–likelihood term) get very second dimension, say, x 2, is linked to some of the previously
large values, and their corresponding means approach to zero; linked nodes and to some new ones according to certain prob-
see, e.g., [19] and [30], where a theoretical justification is pro- abilities. This reasoning carries on until the last dimension of the
vided. The key point here is that allowing the parameters to vary input vector, x L, has been considered. As we see soon, the IBP
independently, with different variances, unveils the specific rele- is a sparsity-promoting prior because out of the infinitely many
vance of every individual parameter to the observed data, and the nodes, only a small number of those is probabilistically selected.
“irrelevant” ones are pushed to zero with high probability. Such In a more formal way, we adopt a binary random variable,
methods are also known as automatic relevance determination z ij ! " 0, 1 ,, i = 1, 2, f, L, and j = 1, 2, f. If z ij = 1, the ith
(ARD). In the “Sparsity-Aware Modeling for GPs” section, we customer (the ith dimension) selects the jth dish (linked to the
demonstrate the use of the ARD philosophy for designing recent jth node). On the contrary, if z ij = 0, the dish is not selected (the
sparse kernels. ith dimension is not linked to the jth node). The binary matrix Z
that is defined from the elements z ij is an infinite dimensional
Remark 1 one, and the IBP is a prior that promotes zeros in such binary
In practice, the choice of a specific prior depends on the tradeoff matrices.
between the expressive power of the prior and the difficulty of One way to implement the IBP is via the so-called stick-
inference. As shown in Table 1, advanced sparsity-promoting breaking construction. The goal is to populate an infinite binary
priors, e.g., the generalized hyperbolic prior and the horseshoe matrix, Z, with each element being zero or one. To this end, we
prior, come with more complicated mathematical expressions. first generate hierarchically a sequence of, theoretically, infinite
These give the priors good flexibility to adapt to different levels probability values, r j, j = 1, 2, f. To achieve this, the beta dis-
of sparsity, while they also pose difficulty in deriving efficient tribution is mobilized. The beta distribution is defined in terms
inference algorithms. Typically, when the noise power is known of two parameters. For the IBP, we fix one of them to be equal
to be small and/or the side information about the sparsity level to one, and the other one, a, is left as a (hyper)parameter, which
is available, sparsity-promoting priors with simple mathematical can either be preselected or learned during training. Then, the
forms, e.g., Student’s t prior, are recommended. Otherwise, one following steps are in order:
might consider the adoption of more complex members in the j
family of GSM priors; see, e.g., [6] and [10]. u j + Beta (u j ; a, 1), r j = % u l, j = 1, 2, f, (28)
l =1
SAL via Bayesian methods for nonparametric models
In this section, we turn our focus to nonparametric models, where, the notation “+” indicates the sample drawn from a dis-
where the number of involved parameters in an adopted model tribution.
is not considered to be known and fixed by the user but, in con- Then, the generated probabilities, rj, are used to populate the
trast, has to be learned from the data during training. A common matrix Z by drawing samples from a Bernoulli distribution that
path in this direction is to assume that the involved number of generates a one with probability rj and a zero with probability
parameters is infinite (practically, a very large number) and then 1 - r j as
leave it to the algorithm to recover a finite set of parameters out
z ij + Bernoulli ^ z ij ; r j h, (29)
of the, initially, infinite ones. To this end, one has to involve pri-
ors that deal with infinitely many parameters. The “true” number for each i = 1, 2, f, L, as illustrated in Figure 6. The beta distri-
of parameters is recovered via the associated posteriors. bution generates numbers between [0, 1], and from the preced-
ing construction, it is obvious that the sequence of probabilities
Indian buffet process prior " r j , goes rapidly to zero due to the product of quantities " u l ,
We introduce the Indian buffet process (IBP) [31] in a general being less than one in magnitude. How fast this takes place is
formulation, and then we see how to adapt it to fit our needs controlled by a, which is known as the innovation or strength
in the context of designing DNNs. Let us first assume that an parameter of the IBP; see, e.g., [1] for more discussion.
Indian restaurant offers a very large number, K, of dishes and let
K " 3. There are L customers. The first one selects a number The art of prior: Sparsity-aware modeling
of dishes with some probability. The second customer selects for three case studies
some of the previously selected dishes with some probability In the previous section, we introduced the indispensable ingre-
and some new ones with another probability and so on until all dients for obtaining sparsity-aware modeling under the Bayes-
L customers have been considered. In the context of designing ian learning framework, namely, the priors. In this section, we
DNNs, customers are replaced by the dimensions of the input (to demonstrate how these priors can be incorporated into some
each one of the layers) vector and the infinitely many dishes by popular data modeling and analysis tools to achieve sparsity-
the number of nodes in a layer. Since we have assumed that the promoting properties. Concretely, we introduce sparsity-aware
architecture is unknown, that is, the number of nodes (neurons) modeling for Bayesian DNNs in the “Sparsity-Aware Modeling
in a layer, we consider infinitely many of those. Then, following for Bayesian DNNs” section, for GPs in the “Sparsity-Aware

28 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Modeling for GPs” section, and for tensor decompositions in the reduced. Of course, this is another name for what we have called
“Sparsity-Aware Modeling for Tensor Decompositions” section. “sparsification” of the network. Over the years, a number of
rather ad hoc techniques have been proposed; see, e.g., [1] and
Sparsity-aware modeling for Bayesian DNNs [19] for a review. In the most recent years, Bayesian techniques
Our focus in this section is to deal with sparsity-promoting tech- have been employed in a more formal and theoretically pleasing
niques to prune DNNs, that is, starting from a network with a way. These techniques are our interest in this article. We focus
large number of nodes, to optimally remove nodes and/or links. on vanilla deep fully connected NNs; however, such techniques
We follow both paths, namely, the parametric one via the GSM have been extended and can be used for the case of, e.g., convo-
priors and the nonparametric one via the IBP prior. lutional networks [2], [3].
The name deep fully connected networks stresses that each
Fundamentals of DNNs node in any of the layers is directly connected to every node of the
NNs are learning machines that consist of a large number of neu- previous layer. To state it in a more formal way, without loss of
rons, which are connected in a layer-wise fashion. After 2010 generality, we consider a deep fully connected network consist-
or so, NNs with many (more than three) hidden layers, known ing of F layers. The number of nodes in the fth (1 # f # F - 1)
as DNNs, dominated the field of machine learning, due to their layer is a f (note that f in a f stands for the fth layer and acts as
remarkable representation power and outstanding prediction a superscript; it does not denote a to the power of f). For the ith
performance for various learning tasks. Since their introduction, node in the fth layer and the jth node in the ( f + 1) th layer, the
f
one of the major tasks associated with their design has been the link between them has a weight w ij, as illustrated in Figure 7. The
so-called pruning, that is, removing redundant nodes and links input vector to the ( f + 1) th layer consists of the outputs from
the previous layer, denoted as y f = 8y 1 , y 2 , f, y a f B , where
f f f T
so that their size and, hence, number of involved parameters is

Stick-Breaking Construction
π1 = u1
π2 = u1u2
π3 = u1u2u3

πj = ul, j = 1, 2, . . .
l=1 {0, 1} K→∞
...
...

α uj πj zij L ...
...
...

(a) 1 0 (b)

FIGURE 6. The implementation of the IBP via the stick-breaking construction. (a) The beta–Bernoulli model. (b) The binary matrix Z.

y1f

f
w12

y2f f
w22
...
. . .

. . .

3
f
w32 y2f + 1 = g yif w f
i2
i =1
...
y3f

(a) (b) (c) (d)

FIGURE 7. A deep fully connected NN. The (a) first layer, (b) fth layer, (c) ( f + 1) th layer, and (d) Fth layer.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 29


y i is the output at the ith node. Here, w jf = 8w 1f j, w 2f j, f, w af f jB
f T

f
p (w ij) = # N (w ijf; 0, g ijf) p (g ijf h;) dg ijf (31)
denotes the vector that collects all the link weights associated
with the jth node. Then, the output of the jth node is
f
in which each functional form of p (g ij; h) in Table 1 corre-
= g e / w ij y i o = g `6w j @ y f j (30)
af
f+1 f f f T sponds to a GSM prior. In particular, the Normal–Jeffreys prior
yj
i =1 and the horseshoe prior were used in [5] and [6]. Next, we show
how to conduct node-wise sparsity-aware modeling (for all the
where g ($) is a nonlinear transformation function (also called an weights connected to that node). Inspired by the idea reported in
f f+1
activation function), and the most widely used ones include the [32], we group the weights {w ij} aj = 1 connected to the ith node
f
rectified linear unit (ReLU) function, the sigmoid function, and and assign a common scale parameter g i to their GSM priors;
f f
the hyperbolic tanh function. i.e., g ij = g i , 6j. Then we have the prior modeling for the ith
f f+1
node related weights {w ij} aj = 1:
Sparsity-aware modeling using GSM priors
# p ` {w ij} aj = 1, g i j dg i
f f+1 f f+1 f f
The basic idea of this approach can be traced back to the pioneer- p ({w ij} aj = 1) =
p ` {w ij} aj = 1 g ij j p ^g i ; hh dg i 
ing work [32] of MacKay, in 1995. He pointed out that for an
# f f+1 f f f
=
NN with a single hidden layer, the weights, each associated with
a f+1
a link between two nodes, can be treated as random variables. = # % N ^w ijf; 0, g if hp^g if ; hhdg if . (32)
The connection weights are associated with zero-mean Gaussian j=1
priors, typically with a shared variance hyperparameter. Then,
appropriate (e.g., Gaussian) posteriors are learned over the con- Furthermore, assuming that the nodes in the fth layer are mutual-
nection weights, which can be used for inference at test time. ly independent, we obtain the prior modeling for all the weights
f f f+1
The variance hyperparameters of the imposed Gaussian priors {{w ij} ai = 1} aj = 1 forwarded from the fth layer:
can be selected to be low enough that the corresponding connec-
af
tion weights exhibit a priori the tendency of being concentrated
% p ({w ijf} aj = 1)
f f f+1 f+1
p ({{w ij} ai = 1} aj = 1) =
around the postulated zero mean. This induces a sparsity “bias” i=1
to the network. The major differences between recent works [5], af a f+1

[6] and the early work [32] lies in their adopted priors. = % # % N (w ijf; 0, g if ) p (g if ; h) dg if . (33)
i=1 j=1
Let us consider a network with multiple hidden layers [5],
f f+1
[6], as depicted in Figure 8. For the ith node in the fth layer and By this modeling strategy, the weights {w ij} aj = 1 related to
f f
the jth node in the (f + 1) th layer, their link has a weight w ij . the ith node are tied together in the sense that when g i (a single
f f+1
For each of the random weights, we can adopt a sparsity-pro- scalar value) goes to zero, the associated weights {w ij} aj = 1 all
moting GSM prior so that become negligible. This makes the ith node in the fth layer

f th Layer ( f + 1)th Layer f th Layer ( f + 1)th Layer f th Layer ( f + 1)th Layer

... ... ...

... ... ...


ith Node ith Node ith Node
...
... Can Be
...
Removed

Consider All the Links


the ith Node Are Inactive
When the Common
... Scale Parameter
wi1f f
ζi → 0
wi2f Modeling
the Weights
wif(a f+1 – 1) ... Via
ith Node Tied Together Via
... GSM Priors
the Same Scale
wiaf f+1 Parameter ζi
f

FIGURE 8. The node-wise sparsity-aware modeling for DNNs using GSM priors.

30 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


disconnected from the (f + 1) th layer and thus blocks the infor- The stick-breaking construction for the IBP prior was utilized
mation flow. Together with the sparsity-promoting nature of since it turns out to be readily amenable to VI. This is a desir-
GSM priors, the prior derived in (33) inclines a lot of nodes to able property that facilitates both training and inference through
be removed from the DNN without degrading the data fitting recent advances in black-box VI, namely, stochastic gradient
performance. This leads to node-wise sparsity-aware modeling variational Bayes, as explained in the “Inference Algorithms for
for deep fully connected NNs. Of course, as is always the case Bayesian DNNs” section. For each i, the considered hierarchical
with Bayesian learning, the goal is to learn the corresponding construction reads as follows:
posterior distributions, and node removal is based on the learned j
values for the respective means and variances.
f f
u j ~Beta (u j | a, 1), r j =
f
% u lf , f f f
z ij ~Bernoulli (z ij | r j ).(34)
l=1

Sparsity-aware modeling using IBP prior During training, posterior estimates of the respective prob-
The previous approach to imposing sparsity inherits a major abilities are obtained, which then allow for a naturally arising
drawback that is shared by all techniques for designing DNNs. component omission (link pruning) mechanism by introducing
That is, the number of nodes per layer has to be specified and a cutoff threshold x ; any link/weight with an inferred posteri-
preselected. Of course, one may say that we can choose a very or below this threshold value is deemed unnecessary and can
large number of nodes and then harness “sparsity” to prune the be safely omitted from computations. This inherent capability
network. However, if one overdoes it, he/she soon runs into prob- renders the considered approach a fully automatic data-driven
lems due to overparameterization. In contrast, we now turn our principled paradigm for SAL based on explicit inference of
attention to nonparametric techniques. We assume that the nodes component utility based on dedicated latent variables.
per layer are theoretically infinite (in practice, a large enough By utilizing the aforementioned construction, we can easily
number) and then use the IBP prior to enforce sparsity (Figure 9). incorporate the IBP mechanism in conventional ReLU-based
In line with what has been said while introducing the IBP networks and perform inference. However, the flexibility of
(the “Indian Buffet Process Prior” section), we multiply each the link-wise formulation allows us to go one step further. In
f
weight, i.e., w ij, with a corresponding auxiliary (hidden) binary recent works, the stick-breaking IBP prior has been employed in
f
random variable, z ij . The required priors for these variables, conjunction with a radically different, biologically inspired, and
f af a f+1
{{z ij} i = 1} j = 1, are generated via the IBP prior. In particular, competition-based activation, namely, the stochastic local win-
f f+1
we define a binary matrix Z f ! R a # a , with its (ij)th element ner takes all (LWTA) [2], [3], [4]. In the general LWTA context,
f
being z ij for the fth layer. Due to the sparsity-promoting nature neurons in a conventional hidden layer are replaced by LWTA
of the IBP prior, most elements in Z f tend to be zero, nulling blocks consisting of competing linear units. In other words, each
f f f+1
the corresponding weights in {{w ij} ai = 1} aj = 1, due to the involved node includes a set of linear (inner product) units. When pre-
multiplication. This leads to an alternative sparsity-promoting sented with an input, each unit in each block computes its acti-
modeling for DNNs [2], [3]. vation; the unit with the strongest activation is deemed to be the

f th Layer ( f + 1)th Layer f th Layer ( f + 1)th Layer

... ...

... ...
i th Node ith Node
... ...
j th Node jth Node

Consider The Link i → j


the Link i → j Can Be Removed

When the f
f f Binary Variable Modeling { zij }
wij × zij
f Via IBP Prior
zij = 0

FIGURE 9. The link-wise sparsity-aware modeling for DNNs using the IBP prior.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 31


winner and passes its output to the next layer, while the rest are In the stochastic LWTA, all pls are treated as binary random
restricted to silence, i.e., the zero value. This is how nonlinearity variables. The respective probabilities, which control the firing
is achieved. (corresponding p = 1) of each linear unit within a single LWTA,
This deterministic winner selection, known as the hard are computed via a softmax type of operation (see, e.g., [1] and
LWTA, is the standard form of an LWTA. However, in [2], a [33]); that is,
new variant was proposed to replace the hard LWTA by a novel L
exp (h nkj)
stochastic adaptation of the competition mechanism implement- Pnkj = J , h nkj = / (z ik w ikj) x ni .(36)
ed via a competitive random sampling procedure founded on / exp (h nkj) i=1
j=1
Bayesian arguments. To be more specific, let a layer in the NN
have a f = L inputs, i.e., x i, i = 1, 2, L, where we use x to denote Note that in the preceding equation, the firing probability of a lin-
the input to any layer, simplifying the discussion. Also, assume ear unit depends on both the input, x n, and on whether the link to
that the number of LWTA blocks in the layer is a f + 1 = K. We the corresponding LWTA block is active or not (determined by
also relax the notation on the number of layer f, and our analysis the value of the corresponding utility variable z ik). Basically, the
refers to any node of any layer. Each LWTA block includes J stochastic LWTA introduces a lateral competition among units
linear units, each one associated with a corresponding weight, in the same layer. How the w’s as well as the c­ orresponding util-
w ikj, i = 1, 2, fL, k = 1, 2, f, K, and j = 1, 2f, J. Consider ity binary variables are learned is provided in the “Inference Al-
the kth LWTA block. We introduce an auxiliary latent variable, gorithms for Bayesian DNNs” section. A graphical illustration
p kj , and the output of the corresponding jth linear unit in the kth of the considered approach appears in Figure 10. Note that as the
block is given by input changes, a different subnetwork, via different connected
L J links, may be followed to pass the input information to the out-
y kj = p kj w Tkj x = p kj / w ikj x i, p kj ! {0, 1}, / p kj = 1.(35) put with high probability. This is how nonlinearity is achieved in
i=1 j=1
the context of the stochastic LWTA blocks.
In other words, the outputs of the linear units are either
the respective inner product between the input vector and the Sparsity-aware modeling for GPs
associated weight vector or zero, depending on the value of We discussed, in the “Bayesian Nonlinear Nonparametric
p kj , which can be either zero or one. Furthermore, only one of Model: GP Regression Example” section, that the kernel func-
the ps in a block can be one, and the rest are zero. Thus, we tion determines, to a large extent, the expressive power of the
can associate with each LWTA block a vector p k ! R J , with GP model. More specifically, the kernel function profoundly
only one of its elements being one and the rest being zero; see controls the characteristics (e.g., smoothness and periodicity)
Figure 10. In machine learning jargon, this is known as the one- of a GP. To provide a kernel function with more expressive
hot vector and can be denoted as p k ! one_hot (J ) . If we stack power and that is adaptive to any given dataset, one way is
together all the p k, k = 1, 2, f, K for the specific layer, we can to expand the kernel function as a linear combination of Q
write p ! one_hot (J ) K . subkernels/basis kernels; i.e.,

Input Layer
IBP and LWTA IBP and LWTA
Output Layer
k (x, xl ) = / a i k i (x, xl ) (37)
Layer Layer i=1

1
ξ=1 1 where the weights, a i, i = 1, 2, f, and Q,
1
z11 =1 can either be set manually or be opti-
x1
mized. Each one of these subkernels can
1
z11 =1 be any one of the known kernels or any
ξ=0
function that admits the properties that
define a kernel; see, e.g., [19]. One may
consider constructing such a kernel ei-
ther in the original input domain or in the
frequency domain. The most straightfor-
ward way is to linearly combine a set of
ξ=0 elementary kernels, such as the SE kernel,
z1a1K = 0
xa1 rational quadratic kernel, periodic kernel,
and so on, with varying kernel hyperpa-
z1a1K = 0
K ξ=1 rameters in the original input domain; see,
K
e.g., [24], [25], and [34]. For high-dimen-
sional inputs, one can first detect pairwise
FIGURE 10. The LWTA and IBP-based architecture. Bold edges denote active (effective) connections
(with z ikf = 1). Nodes with bold contours denote winner units; i.e., they correspond to p = 1 (we do
interactions, using a fast and generic one
not use p kjf, to unclutter the notation). Rectangles denote LWTA blocks. For simplicity, each LWTA developed recently in [35], between the
block includes two (J = 2) competing linear units for k = 1, 2, f, K. inputs and for each interaction pair, adopt

32 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


an elementary kernel or an advanced deep kernel [16]. Such Q

a resulting kernel belongs essentially to the analysis-of- ht = argmaxlog N ( y; 0, b


h
-1
I+ / a i K i (X, X))
i=1
variance family, as surveyed in [1], which has a hierarchical Q
structure and good interpretability. Alternatively, one may / argmin
h
) logdet e b -1 I + / a i K i (X, X) o 
i=1
perform optimal kernel design in the frequency domain by
Q -1
+ y T e b -1 I + / a i K i (X, X) o y3
using the idea of sparse spectrum kernel representation. Due
(38)
to its solid theoretical foundation, in this article, we focus on i=1
the sparse spectrum kernel representation and review some
Q
representative works, such as [7], [36], and [37], at the end / arg min ) logdet e b -1 I + / a i U i (X) U Ti (X) o
of this section. h
i=1
Q -1 
Rationale behind sparsity awareness + y T e b -1 I + / a i Ui (X) U Ti (X) o y3 (39)
i=1
The corresponding GP model with the kernel form in (37) can
be regarded as a linearly weighted sum of Q independent GPs.
In other words, we can assume that the underlying function takes where h = [a T , b] T , (38), in the second line, corresponds to
Q
the form f (x) = R i = 1 fi (x), where fi (x) ~GP (0, a i k i (x, xl )), the original GP model and (39), in the third line, corresponds
for i = 1, 2, g, Q. In practice, Q is selected to be a large value to the equivalent Bayesian linear model mentioned in the pre-
compared to the “true” number of the underlying effective com- ceding. Note that K i (X, X ) represents the N × N kernel ma-
ponents that generated the data, whose exact value is not known. trix of k i (x, xl ) evaluated for all the training input pairs, while
In the following, one can mobilize the ARD philosophy (see the U i (X ) _ [z i (x 1), z i (x 2), f, z i (x N )] T , of size N # Ll , contains
“SAL via Bayesian Methods for Parametric Models” section), the explicit mapping vectors evaluated at the training data. In the
during the evidence function optimization, to drive all unneces- following, they are denoted as K i and U i for brevity.
sary subkernels to zero, namely, promoting sparsity on a. To Mathematical proof of the sparsity property follows that of
this end, let us first establish a bridge between the nonparametric the RVM [41] for the classic sparse linear model. Let us focus
GP model and the Bayesian linear regression model that was on the last expression in (39), which involves both the log
considered in the “Bayesian Linear Parametric Regression: A determinant and the inverse of the overall covariance matrix,
Pedagogic Example” section. C, and mathematically,
From the theory of kernels (see, e.g., [1], [19], and [38]),
Q
each one of the subkernel functions can be written as the inner C = b -1 I + / am Um Um + ai Ui Ui
T T
product of the corresponding feature mapping function, name- m = 1, m ! i 
ly, k i (x, xl ) = z iT (x) z i (xl ), where z i (x): R L 7 R Ll and it = C -i + ai Ui Ui
T
(40)
is often assumed that Ll & L. As a matter of fact, the feature
mapping function results by fixing one of the arguments of the where we have separated the ith subkernel from the rest.
kernel function and making it a function of a single argument; For clarity, let us focus on the kernel hyperparameters,
T
i.e., z i (x) = k i (x, ·), where “·” denotes the free variable(s) of a = [a 1, a 2, f, a Q] ; namely, we regard b as known and remove
the function and is filled by xl . In general, z i (x) is a function. it from h. Applying the classic matrix identities [24] and insert-
However, in practice, if needed, this can be approximated by a ing the results back into (39), we get L (a) = L (a -i) + c (a i),
very high-dimensional vector constructed via the famous ran- where L (a -i) is simply the evidence function with the ith sub-
dom Fourier feature approximation [1], [39]. Then, each inde- kernel removed, and the newly introduced quantity
pendent GP process, fi (x) ~GP (0, a i k i (x, xl )), by mobilizing
c ^a ih _ -
the definition of the covariance matrix, can be equivalently 1 ln (a ) - 1 log I + a U < C -1 U
i i i -i i
2 2
interpreted as fi (x) _ i Ti z i (x), where the weights, i i, of size 
+ 1 y < C --i1 U i ^a i-1 I + U i< C --i1 U i h U <i C --1i y.
-1
Ll # 1, are assumed to follow a zero-mean Gaussian distribu- (41)
2
tion; i.e., i i ~N (0, a i I ) . Therefore, one can alternatively write
Q
f (x) = R i = 1 i Ti z i (x), where i i and i j are assumed to be It is not difficult to verify that the evidence maximization
mutually independent for i ! j. Essentially, a GP with such a problem in (39) boils down to maximizing the c ^a i h when fix-
kernel c­ onfiguration is a special case of the more general sparse ing the rest of the parameters to their previous estimates.
linear model family, which can also incorporate, apart from a This means that we can solve for the hyperparameters in a
Gaussian prior, heavy-tailed priors to promote sparsity, such as sequential manner. Taking the gradient of c ^a i h with respect
those surveyed in the “SAL via Bayesian Methods for Paramet- to the scalar parameter a i and setting it to zero gives the
ric M­ odels” section. A more detailed presentation of the sparse global maximum, as can be verified by its second-order
linear models can be found in some early references, such as derivative. Interestingly, the solution to a i is either zero or a
[30] and [40]. positive value, mainly depending on the relevance between
As we mentioned before, the GP model hyperparameters can the ith subkernel function and the observed data [19]. Only
be optimized through maximizing the logarithm of the evidence if their relevance is high enough will a i take a nonzero posi-
function, L (h) _ log p (y; h), and using (21), we obtain tive value. This explains the sparsity-promoting rationale

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 33


behind the method. In the “The Art of Inference: Evidence additional flexibility of the kernel obtained through optimization
Maximization and Variational Approximation” section, we pro- can significantly improve the fitting performance, as it enables
vide an advanced numerical method for solving the hyperparam- automatic learning of the best kernel function for any specific
eters that maximize the evidence function. problem. The resulting spectral density of (43) is a set of sparse
Dirac deltas for approximating the underlying spectral density.
Sparse spectrum kernel representation In [37], the Dirac deltas are replaced with a mixture of
At the beginning of this section, we expressed kernel expansion Gaussian basis functions in the frequency domain, leading
in terms of a number of subkernels and introduced two major to the so-called spectral mixture (SM) kernel. The SM kernel
paths (either in the original input domain or in the frequency can approximate any stationary kernel arbitrarily well in the
domain). In the following, we turn our focus to the frequency , 1-norm sense, due to Wiener’s theorem of approximation [42].
domain representation of a kernel function and on techniques Concretely, the underlying spectral density is approximated by
that promote sparsity in the frequency domain, leading to the a Gaussian mixture as
sparse spectrum kernel representation. To start with, it is as- Q
) exp = G
- (~ - n i) 2
sumed that the underlying function has only a few effective S (~) = 1 / ai
2 i=1 2rv i2 2v i2
frequency bands/points in the real physical world. Second, the 
+ exp = G3
kernel function takes a linearly weighted sum of basis func- - (~ + n i) 2
(44)
tions, similar to the ARD method for linear parametric mod- 2v 2i
els; thus, only a small number of the functions are supposed
to be relevant to the given data, from the algorithmic point of where Q is a fixed number of mixture components and a i, n i,
view. Sparse solutions can be obtained from maximizing the and v 2i are the weight, mean, and variance parameters of the
associated evidence function, as introduced in the “The Art of ith mixture component, respectively. It is noteworthy that the
Inference: Evidence Maximization and Variational Approxima- sum of the two exponential functions on the right-hand-side of
tion” section. For the ease of narration, we constrain ourselves (44) ensure the symmetry of the spectral density. For illustration
to 1D input space, namely, x ! R 1, but the idea can be easily purposes, we draw the comparison between the original sparse
extended to the multidimensional input space. Often, we have x spectrum kernel [36] and the SM kernel [37] in Figure 11.
= t for 1D time series modeling. Taking the inverse Fourier transform of S (~) yields a sta-
The earliest sparse spectrum kernel representation was pro- tionary kernel in the time domain, as follows:
posed in [36] and developed upon a Bayesian linear regression Q

model with trigonometric basis functions; namely, k (t, t l ; h p) = k (x; h p) = / a i exp 6- 2r 2 x 2 v i2@ cos (2rxn i)
i=1
 (45)
Q
f (x) = / a i cos (2r~ i x) + b i sin (2r~ i x) (42)
i=1 where h p = [a 1, a 2, f, a Q, n 1, n 2, f, n Q, v 21, v 22, f, v Q2 ]T
denotes the hyperparameters of the SM kernel to be optimized
where {cos (2r~ i x), sin (2r~ i x)} constitute one pair of basis and x _ x - xl , owing to the stationary assumption. For accurate
functions parameterized in terms of the center frequencies ~ i, approximation, however, we need to choose a large Q, which po-
i ! {1, 2, f, Q}, and the random weights, a i and b i, are inde- tentially leads to an overparameterized model with many redun-
pendent and follow the same Gaussian distribution, N (0, v 20 /Q) dant localized Gaussian components. Besides, optimizing the
(it is noteworthy that ~ ! [0, 1/2) represents a normalized fre- frequency and variance parameters is numerically difficult as a
quency, namely, the physical frequency over the sampling fre- nonconvex problem and often incurs bad local minima.
quency). Under such assumptions, f(x) can be regarded as a GP, To remedy the aforementioned numerical issue, in [7],
according to the “Bayesian Nonlinear Nonparametric Model: it was proposed to fix the frequency and variance param-
GP Regression Example” section, and the corresponding covari- eters, n 1, n 2, f, n Q, v 12, and v 22, f, v Q2 , in the original SM
ance/kernel function can be easily derived as kernel to some known grids and focus solely on the weight
parameters, a 1, a 2, f, a Q . The resulting kernel is called
2 2 Q
the grid SM (GridSM) kernel. By fixing the frequency and
k (x, xl ) =
v0
Q
T
z (x) z (x l ) =
v0
Q
/ cos (2r~ i (x - xl )) (43) variance parameters, the preceding GridSM kernel can be
i=1
regarded as a linear multiple kernel with Q basis subkernels,
where the feature mapping vector z (x) contains all Q pairs of k i (x) _ exp 6- 2r 2 x 2 v i2@ cos (2rxn i), and i = 1, 2, f, Q. In [7],
trigonometric basis functions. it was shown that for sufficiently small variance, each subker-
Usually, we favor a large value of Q, well exceeding the nel matrix has a low rank smaller than N/2, namely, half of the
expected number of effective components. If the frequency data size. Therefore, it falls under the formulation in (38). The
points are randomly sampled from the underlying spectral den- corresponding weight parameters of such an overparameterized
sity, denoted by Su (~), then (43) is equivalent to the random Fou- kernel can be obtained effectively via optimizing the evidence
rier feature approximation of a stationary kernel function [39]. function in (38), and the solution turns out to be sparse, as dem-
However, in [36], the center frequencies are optimized through onstrated in the “The Art of Inference: Evidence Maximization
maximizing the evidence function. As claimed in the paper, such and Variational Approximation” section.

34 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Sparsity-aware modeling for tensor decompositions R

In the previous sections, we elucidated the sparsity-aware mod- D= / 1a4444


( 1) ( 2) (P)
r % ar % g % ar
244443
r=1 rank-1 tensor 
eling for two recent supervised data analysis tools, namely,
DNNs and GPs. The underlying idea of employing an overpa- _ " A , A , f, A ,
( 1) ( 2) (P)
(46)
rameterized model and embedding sparsity via an appropriate
(p)
prior has also inspired recent sparsity-aware modeling for unsu- where % denotes vector outer product and A (p) _ [a 1 ,
(p) (p )
pervised learning tools in the context of tensor decomposition; a 2 , f, a R ] ! R J p # R, 6p is called the factor matrix. The
see, e.g., [10], [11], [12], [13], [14], and [15]. For pedagogical smallest R that yields the preceding expression is termed the ten-
purposes, we first introduce the basics of tensors and tensor ca- sor rank.
nonical polyadic decomposition (CPD), the most fundamental From this definition, it is readily seen that tensor CPD is a
tensor decomposition model in unsupervised learning. multidimensional generalization of a matrix decomposition in
terms of rank-1 representation. In particular, when P = 2, (46)
Tensors and CPD reduces to decomposing a matrix D ! R J 1 # J 2 into the summa-
Tensors are regarded as multidimensional generalizations of tion of R rank 1 matrices; i.e., D = R rR= 1 a (r1) % a (r2) . By defining
matrices, thus providing a natural representation for any mul- the term a (r1) % a (r2) % g % a (rP) as a P-D rank-1 tensor, CPD essen-
tidimensional dataset. Specifically, a P-dimensional (P-D) tially seeks R rank-1 tensors/components from the observed
dataset can be represented by a P-D tensor D ! R J 1 # J 2 # gJ P dataset, each corresponding to one specific underlying source
[43]. Given a tensor-represented dataset D, the unsupervised signal. Thus, the tensor rank R has a clear physical meaning;
learning considered in this article aims to identify the under- namely, it corresponds to the number of underlying source sig-
lying source signals that generate the observed data. In dif- nals. Differing from matrix decomposition, where the rank 1
ferent fields, this task gets different names, such as clustering components are, in general, not unique, CPD for a P-D tensor
in social network analysis [44], blind source separation in (P 2 2) gives unique rank 1 components under mild conditions
electroencephalogram (EEG) and functional magnetic reso- [43]. The uniqueness endows the superior interpretability of the
nance imaging (fMRI) data analysis [45], [46], and blind sig- CPD model used in various unsupervised data analysis tasks.
nal estimation in radar/sonar signal processing [47]. In these
applications, tensor CPD has been proved to be a powerful Low-rank CPD and sparsity-aware modeling
tool with good interpretability. Formally, tensor CPD is de- In real-world data analysis, the number of underlying source sig-
fined as follows [43]. nals is usually small. For instance, in brain source imaging [45],
[46], both EEG and fMRI data analysis outcomes have shown
Definition of tensor CPD that only a small fraction of source signals contribute to the
Given a P-D tensor D ! R J 1 # J 2 # gJ P, CPD seeks to find the observed brain activities. This suggests that the assumed CPD
vectors {a (r1), a (r2), f, a (rP)} rR= 1 such that model should have a small tensor rank R to avoid data overfitting.

S(ω) S(ω)
140 250

120
200
100
Spectral Density

Spectral Density

150
80

60 100

40
50
20
ω ω
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3
Normalized Frequency Normalized Frequency
(a) (b)

Underlying Spectral Density SM Kernel Sparse Spectrum Kernel

FIGURE 11. A comparison of the SM kernel and the original sparse spectrum kernel in (43) for approximating the underlying spectral density. Herein, the
SM kernel employs a mixture of Gaussian basis functions (see the blue curves), while the original sparse spectrum kernel employs a mixture of Dirac
deltas (see the red vertical lines). (a) Band-pass shape spectrum. (b) Low-pass shape spectrum.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 35


In the following, we show how the low rankness is embedded if g l approaches zero, the elements in q l will shrink to zero
into the CPD model through practicing the ideas reported in the simultaneously, thus nulling a rank-1 tensor, as illustrated
previous two sections. First, we employ an overparameterized in Figure 12. Since the prior distribution given in (47) favors
model for CPD by assuming an upper-bound value L of tensor zero-valued rank-1 tensors, it promotes the low rankness of the
rank R; i.e., L & R. The low rankness implies that L – R rank-1 CPD model.
(p)
tensors should be zero, each specified by vectors {a l } Pp = 1, 6l.
P
(1) (2) (P)
In other words, let vector q l _ [a l ; a l ; f; a l ] ! R R p = 1 J p, 6l. Remark 2
The low rankness indicates that a number of vectors in the set If the factor matrices are further constrained to be nonnegative
{q l} lL= 1 are zero vectors. for enhanced interpretability in certain applications, simple mod-
(p)
To model such sparsity, we adopt the following multivariate ification, that is, multiplying a unit step function U (a l $ 0)
(p)
extension of the GSM prior introduced in the “SAL via Bayes- (which returns one when a l $ 0, or zero otherwise), to the pri-
ian Methods for Parametric Models” section; that is, or derived in (47) can be made to embed both the nonnegative-
ness and the low rankness; see the in-depth discussions in [11].
P
/ Jp
p=1

p (q l) = # % N ([q l] i; 0, g l) p (g l; h p) dg l Extensions to other tensor decomposition models


i=1 Similar ideas have been applied to other tensor decomposi-
= # N (q l; 0, g l I ) p (g l; h p) dg l  tion models, including Tucker decomposition (TuckerD) [48]
P and tensor train decomposition (TTD) [49], [50], [51]. In these
= # % N (a l(p); 0, g l I) p (g l; h p) dg l (47) works, one first assumes an overparametrized model by setting
p=1
the model configuration parameters (e.g., multilinear ranks in
TuckerD and TT ranks in TTD) to be large numbers and then
where [q l] i denotes the ith element of vector q l . Since the el- imposes the GSM prior on the associated model parameters to
ements in q l are assumed to be statistically independent, then control the model complexity; see the detailed discussions in
according to the definition of multivariate Gaussian distribu- [48], [49], [50], and [51].
tion, we have the second and third lines of (47), showing the
equivalent prior modeling on the concatenated vector q l and Remark 3
(p)
the associated set of vectors {a l } Pp = 1, respectively. The mix- Some further suggestions are given for choosing appropriate ten-
ing distribution p (g l; h p) can be any of those listed in Table 1. sor decomposition models for different data analysis tasks; see,
Note that in (47), the elements in vector q l are tied together via e.g., [52]. If the interpretability is crucial, one might try CPD
a common hyperparameter g l . Once the learning phase is over, (and its structured variants) first, due to its appealing uniqueness

= +...+ +...+ = +...+ +...+

3D X1 Xl XL 3D X1 Xl XL
Tensor Data First Rank-1 lth Rank-1 Tensor Lth Rank-1 Tensor Data First Rank-1 l th Rank-1 Tensor Lth Rank-1
X Tensor Tensor X Tensor Tensor
Denotes Zero

The Rank-1 Tensor


Becomes All Zero

Tied Together Via When the Common Modeling


ql = the Same Scale Scale Parameter the Column
Parameter ζl Elements
ζl → 0
Via
GSM Priors

FIGURE 12. The sparsity-aware modeling for rank-1 tensors using GSM priors.

36 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


property. On the other hand, if the task is related to subspace where the lower bound
learning, TuckerD should be considered since its model param-
eters can be interpreted as the basis functions and the associated
coefficients. For missing data imputation, TTD is a good choice,
L (q (i); h) _ # q (i) log p (Dq (,ii); h) di (49)
as it disentangles different tensor dimensions. More concrete ex-
amples can be found in the recent overview paper [52]. is called the evidence lower bound (ELBO) and q (i) is known
as the variational distribution. The tightness of the ELBO is de-
The art of inference: Evidence maximization termined by the closeness between the variational distribution
and variational approximation q (i) and the posterior p M (i ; D; h), measured by the Kull-
Having introduced sparsity-promoting priors, we are now at the back–Leibler (KL) divergence, KL (q (i) < p M (i ; D; h)). In
stage of deriving the associated Bayesian inference algorithms other words, the ELBO becomes tight; i.e., the lower bound be-
that aim to learn both the posterior distributions of the unknown comes equal to the evidence when KL (q (i) < p M (i ; D; h)) = 0,
parameters/functions and the optimal configurations of the model which holds true if and only if q (i) = p M (i ; D; h). This is
hyperparameters. In the “Evidence Maximization Framework” easy to see if we expand (49) and reformulate it as
section, we first show that the inference algorithms developed for
our considered data analysis tools can be unified into a common log p M (D; h) = L (q (i); h) + KL (q (i) < p M (i ; D; h)). (50)
evidence maximization framework. Then, for each data analysis
tool, we further show how to leverage recent advances in varia- Since the KL divergence is nonnegative, the equality in (48)
tional approximation and nonconvex optimization to deal with holds if and only if it is equal to zero. Since the ELBO in (49)
specific problem structures for enhanced learning performance. involves two arguments, namely, q (i) and h, solving the maxi-
Concretely, we introduce inference algorithms for GP in the “In- mization problem
ference Algorithms for GP Regression” section, for tensor de-
max L (q (i); h) (51)
compositions in the “Inference Algorithms for Bayesian Tensor q (i), h
Decompositions” section, and for Bayesian DNNs in the “Infer-
ence Algorithms for Bayesian DNNs” section. can provide both an estimate of the model hyperparameters and
the posterior distributions. These two terms can be optimized in
Evidence maximization framework an alternating fashion. Different strategies for optimizing q (i)
Given a data analysis task and having selected the learning and h result in different inference algorithms. For example, the
model, M, that is the associated likelihood function p M (D ; i) variational distribution q (i) can be optimized either via func-
and a sparsity-promoting prior p M (i; h p), the goal of Bayes- tional optimization [59] or the Monte Carlo method [60], while
ian SAL is to infer the posterior distribution p M (i ; D; h) by the hyperparameters h can be optimized via various nonconvex
using Bayes’ theorem, given in (1), and to compute the model optimization methods [53], [54], [55], [61], [62]. In the follow-
hyperparameters h by maximizing the evidence p M (D; h). We ing sections, we introduce some inference algorithms designed
differentiate the following two cases of the evidence function. specifically for the three popular data analysis tools introduced
First, if the evidence p M (D; h) defined in (5) can be derived in the “The Art of Prior: Sparsity-Aware Modeling for Three
analytically, such as (11) in the Bayesian linear regression exam- Case Studies” section that have been equipped with certain spar-
ple, the model hyperparameters h can be learned via solving the sity-promoting priors.
evidence maximization problem, for which advanced noncon-
vex optimization tools, e.g., [53], [54], [55], [56], and [57], can Inference algorithms for GP regression
be utilized to find high-quality solutions. In this case, since the Let us start with the GP model for regression because, in this
prior, likelihood, and evidence all have analytical expressions, case, the evidence function p M (D; h) can be derived ana-
applying Bayes’ theorem (1) yields a closed-form posterior dis- lytically owing to the Gaussian prior and likelihood assumed
tribution of the unknown parameters. throughout the modeling process. In this section, we introduce
Unfortunately, in most cases, the multiple integration required an effective inference algorithm for GP regression based on the
in computing the evidence (5) turns out to be analytically intrac- linear multiple kernel in (37). In the “The Art of Prior: Sparsity-
table. Inspired by the ideas of the minorize-maximization [also Aware Modeling for Three Case Studies” section, we already
called majorization–minimization (MM)] optimization frame- derived the logarithm of the evidence function in analytical
work [56], we can seek a tractable lower bound (or a valid sur- form, as shown in (38). Therefore, we can optimize it directly to
rogate function, in general) that minorizes the evidence function obtain an estimate of the model hyperparameters h.
and maximize the lower bound iteratively until convergence. It Traditionally, one could estimate the weights of the subker-
has been shown (see, e.g., [1], [19], and [58]) that such an opti- nels, a i, i = 1, 2, f, Q, as well as the precision parameter, b,
mization process can push the evidence function to a stationary using an iterative algorithm similar to the one derived in [41]. In
point. More concretely, the logarithm of the evidence function is particular, one sequentially solves for a i, i = 1, 2, f, Q, from
lower bounded as follows: the equation c (a i) = 0 derived in (41), by fixing the rest of the
weights to their latest estimate and then checks its relevance
log p M (D; h) $ L (q (i); h) (48) with the data in each iteration. This iterative method works quite

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 37


well for various different datasets. In the following, however, direction method of multipliers (ADMM) algorithm, as pro-
we introduce a potentially more effective numerical algorithm posed in [7]. In general, the ADMM algorithm takes the form
in terms of the sensitivity to an initial guess and the data fitting of a decomposition coordination procedure, where the original
performance than the original one [30]. large problem is decomposed into a number of small local sub-
Next, we take the GridSM kernel in (45) as an example of problems that can be solved in a coordinated way [64].
the linear multiple kernel and rewrite the evidence maximization For our problem, the idea is to reformulate the original prob-
problem as lem by introducing an N × N matrix S and solve, instead,
)
h = arg min l (h) _ g (h) - h (h) (52)
h arg min y T Sy - logdet (S)
S, a
Q 
s.t. S e / a i K i + b -1 I o = I, a $ 0.
T T T -1
where h = [a ; b] , with a $ 0 and b 2 0, g (h) _ y C (h) y, (55)
and h (h) _ - logdet (C (h)). Let us introduce a short notation, i
Q
C (h) _ R i = 1 a i K i + b -1 I, where K i represents the ith subker-
nel matrix evaluated with the training inputs. It can be shown Then, an augmented Lagrangian function can be formulated and
that g (h) and h (h) : H " R are both convex and differentiable solved by the ADMM algorithm through iteratively updating
functions, with H being a convex set. Therefore, the cost func- the auxiliary matrix variable S; the kernel hyperparameters, a;
tion in (52) is a difference of two convex functions with respect and some associated dual variables. By introducing an auxiliary
to h, and the optimization problem belongs to the well-known matrix variable S, all ADMM subproblems become convex;
difference-of-convex program (DCP) [63]. Instead of adopting in particular, the weight parameters, a, are derived in closed
the classic iterative procedure proposed for the RVM [41], we form. From the experimental evaluation results given in [7], this
take advantage of the DCP optimization structure. Such a favor- ADMM algorithm can potentially find a better local minimum
able structure may help to accelerate the convergence process with improved prediction accuracy compared with the MM al-
and avoid a bad local minimum of the optimization problem and gorithm, although at the cost of increased computational time,
thus further improve the level of sparsity [7]. in practice.
The computational complexity for one iteration of the MM
Sequential MM algorithm Q
algorithm is O (n 2 $ max (n, R i = 1 ri)), where ri stands for the
The main idea is to solve min h ! H l (h), with H 3 R Q + 1, through rank of K i . The MM algorithm benefits from the low-rank
an iterative scheme, where in each iteration, a so-called ma- property of the GSM subkernels. Let the average rank of the
Q
jorization function rl (h, h k) of l (h) at h k ! H is minimized; i.e., GSM subkernels be r = 1/Q R i = 1 ri % n; if Qrr 2 n, then the
overall complexity of the MM algorithm scales as O (Q $ r $ n 2);
h
k+1
= arg min rl (h, h k) (53) otherwise, it scales as O (n 3). Similar conclusions hold for the
h!H
ADMM algorithm, too. These results show that the complexity
where the majorization function rl ($ , $) : H # H " R satisfies also relies on the preselected number of subkernels, Q, which is
rl (h, h) = l (h) for h ! H and l (h) # rl (h, hl ) for h, hl ! H. We often set to a larger value than the one that it is actually required;
adopt the so-called linear majorization. Concretely, we make the however, setting this parameter adaptively and economically for
convex function h (h) affine by performing the first-order Taylor different datasets remains an open challenge. In addition to the
expansion and obtain previously mentioned MM and ADMM algorithms, one could
also resort to some other advanced optimization algorithms to
rl (h, h k) _ g (h) - h (h k) - d Th h (h k) (h - h k).(54) solve the problem; for instance, the successive convex approxi-
mation algorithms reviewed in [56] are of great potential.
In this way, minimizing the cost function in (53) becomes a
convex optimization problem in each iteration. By fulfilling the Inference algorithms for Bayesian tensor decompositions
regularity conditions, the MM method is guaranteed to converge In this section, we introduce the inference algorithm design for
to a stationary point [56]. Bayesian tensor decompositions. Our focus is on presenting the
Next, we show how (53), with the linear majorization in (54), key ideas for deriving inference algorithms for the Bayesian ten-
can be solved. Since g (h) is a matrix fractional function, in each sor CPD model [10], [11], [12], [13], [14], [15] via the Gauss-
iteration, (53) actually solves a convex matrix fractional mini- ian likelihood and the GSM prior (introduced in the “Sparsity-
mization problem [63], which is equivalent to a semidefinite Aware Modeling for Tensor Decompositions” section). For
programming problem via the Schur complement. This problem other tensor decomposition formats, e.g., the Bayesian tensor
can be further cast into a second-order cone program that can be TuckerD [48] and TTD [51], since they share the same prior de-
efficiently and reliably solved using some off-the-shelf convex sign principle as that of CPD, the associated inference algorithm
solvers, such as MOSEK [7]. Although the previous MM algo- follows a similar rationale.
rithm can often lead to rather good solution, it cannot ensure In the Bayesian tensor CPD, the goal of inference is to
the local minimal in all cases. Occasionally, we found that it estimate the posterior distributions of factor matrices
provides less satisfactory results, and they can be significantly {A (p) ! R J p # L} Pp = 1 from possibly incomplete P-D tensor data
improved by using a novel nonlinearly constrained alternating observations YX ! R J 1 # g # J P, where Y j 1, f, j P is observed if the

38 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


P-tuple indices ( j 1, f, j P) belong to the set X. The forward q (i k) has been shown to have the following optimal solution
problem is commonly modeled as a Gaussian likelihood: (see, e.g., [1]):

p (YX ; {A ( p)} Pp = 1; b) = % N (Y j 1, f, j P ; exp ` E % q^i jh 6ln p ^D, ih@j


( j 1, f, j P) ! X  q * ^i k h =
j!k
(60)
" A , f, A (P), j 1, f, j P, b -1)
( 1)
(56) # exp `E % q^ i jh 6ln p ^D, ih@j di k
j!k

where b is the precision (the inverse of variance) of the Gauss- where E q (·) 6 · @ denotes the expectation with respect to the varia-
ian noise. Since it is unknown, a noninformative prior [e.g., the tional pdf q (·) . The inference framework under the MF assump-
Jeffery prior, p (b) ? 1/b] can be employed. To promote the tion is termed MF-VI.
low rankness, for the lth columns of all the factor matrices, a Whether the integration in the denominator (60) has a closed
GSM sparsity-promoting prior with latent variance variable g l form is determined by the functional forms of the likelihood and
has been adopted; see the detailed discussions in the “Sparsity- the priors. In particular, if they are conjugate pairs within the
Aware Modeling for Tensor Decompositions” section. Usually, exponential family of probability distributions (see the discus-
the hyperparameters h in the adopted GSM priors are prese- sions in, e.g., [1] and [19]), the optimal variational pdf in (60)
lected to make the prior noninformative and thus need no fur- will accept a closed-form expression. Fortunately, for the Bayes-
ther optimization. The unknown parameters i include the fac- ian tensor CPD adopting the Gaussian likelihood and the GSM
tor matrices {A (p)} Pp = 1, the latent variance variables {g l} lL= 1 of prior for the columns of the factor matrices, this condition is
the GSM priors, and the noise precision b. Under the evidence usually satisfied, which enables the derivation of closed-form
maximization framework, the inference problem can be formu- updates in recent advances [10], [11], [12], [13], [14], [15].
lated as (51) [the expression of the objective function in (51) is
quite lengthy (see, e.g., [14]) and thus is not included here] with Remark 4
unknown parameters, To facilitate the algorithm derivation, the MF-VI imposes a fac-
torization structure on q (i), which implies the statistical inde-
i _ {{A }p = 1, {g l} lL= 1,
( p) P
b}(57) pendence of the variables i k, given the observed dataset D. If
this is not the case, the MF approximation will lead to a mis-
and the joint pdf match when approaching the ground truth posteriors. In general,
the MF-VI tends to provide posterior approximations that are
p (D, i) _ p (YX, {{A (p)}Pp = 1, {g l} lL= 1, b}) (58) more compact compared to the true ones, which means that the
posterior estimates are usually “overconfident” [19]. To achieve
which can be computed by the product of the likelihood and the more accurate posterior estimation, there is a research trend to
priors. employ more advanced variational approximation techniques
Without imposing any constraint on the pdf q (i), the optimal than MF approximation. For example, recent tensor-aided
solution is just the posterior, i.e., q ) (i) = p M (i ; YX), whose Bayesian DNN research [49] utilizes the kernelized Stein dis-
computation using Bayes’ theorem will, however, encounter crepancy to derive the inference algorithm that can approximate
the intractable multiple integration challenge. To get over this the posterior better. The interested reader may refer to [59] for
difficulty, modern approximate inference techniques propose to some recent advances in variational approximation methods.
solve problem (51) by further constraining q (i) into a function- Some computational and theoretical difficulties that are com-
al family F; i.e., q (i) ! F. It is hoped that the family F is as monly encountered in Bayesian tensor decompositions are sum-
flexible as possible to allow accurate posterior estimates and, at marized as follows. First, due to the block coordinate descent
the same time, simple enough to enable tractable optimization nature of MF-VI [59], it is crucial to choose informative initial
algorithm designs. values to avoid poor local minima. On the other hand, the associ-
Among all the functional families, the mean field (MF) fam- ated computational complexity is cubic with respect to the ten-
ily is undoubtedly the most favorable one in recent Bayesian ten- sor rank (see, e.g., [10], [11], [14], and [51]), which is high if
sor research [10], [11], [12], [13], [14], [15]. It assumes that the the initial tensor rank is set to a large value. Finally, it was found
variational pdf q (i) = P Kk = 1 q (i k), where i is partitioned into that the algorithm performance significantly degrades in some
mutually disjoint nonempty subsets i k (i.e., , Kk = 1 i k = i, and challenging regimes, e.g., a low signal-to-noise ratio (SNR) and/
+ Kk = 1 i k = Q). In the context of the Bayesian tensor CPD, the or high rank (see, e.g., [10], [11], [14], and [51]). To overcome
MF assumption states that these difficulties, suggestions for real algorithm implementa-
P tions are provided in the “Unsupervised Learning via Bayesian
q (i) = % q (A (p)) q ({g l} lL= 1) q (b).(59) Tensor Decompositions” section.
p=1

The factorized structure in (59) inspires the idea of block mini- Inference algorithms for Bayesian DNNs
mization in the optimization theory. In particular, for the ELBO The step of inference (training) for Bayesian DNNs follows the
maximization problem (51), specified after fixing the varia- same backpropagation-type of philosophy as that of training
tional pdfs {q (i j)} j ! k, the resulting subproblem that optimizes their deterministic counterparts. There are, however, two notable

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 39


differences. First, the unknown (synaptic) parameters/weights block, e.g., the kth one, is weighted by a utility binary random
are now described via parameterized distributions. Thus, the variable, z ik . This is set equal to one if the ith input dimen-
cost function to be optimized has to be expressed in terms of the sion is presented to the kth LWTA block; otherwise z ik = 0.
hyperparameters that define the respective distributions, instead We impose the sparsity-inducing IBP prior over these utility
of the weights/synapses. This involves the so-called reparam- hidden variables.
eterization step, and we describe it in the “Inference Algorithms We are now ready to write the output of a specific layer of the
for Bayesian DNNs” section. Second, the evidence function to considered model, i.e., y n ! R K $ J , as follows:
be maximized is not of a tractable form, and it has to be approxi-
L
mated by its ELBO [see the definition in (49)]. In this section,
we outline the basic steps that are followed for VI in the case of
y nkj = p nkj / (w ikj z ik) x ni ! R (61)
i=1
a Bayesian deep network that includes layers with 1) stochastic
LWTA blocks, 2) stochastic synaptic weights of Gaussian form, where x n is the Lth dimensional input that coincides
and 3) a sparsity-inducing mechanism imposed over the network with the output of the previous layer. The involved ran-
synapses that is driven via an IBP prior. We discussed this type dom variables, whose posterior distributions are to
of network in the “Sparsity-Aware Modeling Using IBP Prior” be learned during training, are 1) the synaptic weights,
section; also see Figure 10. To facilitate understanding, we pro- w ikj, i = 1, 2, ..., L, k = 1, 2, f, K, and j = 1, 2, f, J, for all
vide a graphical illustration of the considered stochastic LWTA layers; 2) the utility variables, z ik, for all layers; and 3) the indi-
block in Figure 13. cator vectors, p nk, for the nth sample and the kth LWTA for all
Without harming generality, let us focus on a specific layer, layers. The functional form of the respective distributions are
say, the ( f + 1) th one. To slightly unclutter the notation, assume as follows:
that for this layer, the input dimension a f = L, and the num- ■■ Synaptic weights:
ber of nodes (LWTA units) a f + 1 = K. Thus the corresponding
input matrix to the layer becomes X ! R N # L, with N samples, Prior : p (w ikj) + N (w ikj ; 0, 1)
each comprising L features. Under the stochastic LWTA-based Posterior : q (w ikj) + N (w ikj ; n ikj, g ikj)
modeling rationale, nodes (neurons) are replaced by LWTA
blocks, each containing a set of J competing linear units. Thus, where the mean and variance, n ikj and g ikj, respectively, are
the layer input is now presented to each different block and learned during training.
each unit therein via different weights. Thus, the weights for ■■ Utility binary random variables:
this layer are now represented via a 3D matrix W ! R L # K # J
(we again restrict our notation on the dependence on f, i.e., Prior : Bernoulli (z ik ; r ik)
the layer index). Recall from the “Sparsity-Aware Modeling Posterior : q (z ik) = Bernoulli (z ik ; ru ik)
Using IBP Prior” section that each layer is associated with a
latent discrete random ­vector, p n ! one_hot (J ) K , that encodes where r ik come form the IBP prior (see the “SAL via Bayes-
the outcome of the local competition among the units in all K ian Methods for Nonparametric Models” section) and ru ik are
LWTA blocks of a network layer when the nth input sample learned during training. The use of the Bernoulli distribution
is presented. Furthermore, recall that each link connecting is imposed by the binary nature of the variable.
an input dimension of the nth sample, e.g., x ni, to an LWTA ■■ Indicator random vectors, p nk:
...

x1 ×w
1kj zN ...
(µ1 , yk1
x2 kj ζ
1kj )
×w
...

2kj z N(µ ξkj


2kj , ζ2 ) L
kj 1, for j = j0
+ × ykj = ξkj wikj xi With ξkj =
...

0, for j ≠ j0
i=1
...

)
, ζ Lkj ... ykJ
(µ Lkj
zN
xL × w Lk j
...

FIGURE 13. The kth block of a stochastic LWTA layer. Input x = [x 1, x 2, f, x L] is presented to each unit, j = 1, 2, f, J, in the block. Assume that the index
of the winner unit is j = j 0 . Then, the output of the block is a vector with a single nonzero value at index j 0 .

40 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


tion, provided that this sample is written as a differentiable func-
Prior : p (p k) = Categorical c p k 1 , f, 1 m tion of H and some low-variance random variable, e.g., [65] and
J J
i.e., all linear units equiprobable, [66]. The rest of the terms in (62) are KL divergences that bias
Posterior : q (p nk) = Categorical ^p nk Pnk1, f, PnkJ h
the posteriors to be as close as possible to the corresponding pri-
ors. In other words, they act as regularizers that bias the solution
toward certain regions in the parameter space as dictated by the
where Pnkj is defined via the softmax operation, e.g., adopted priors. For example, in the last term, the posterior of the
(36). The categorical distribution is imposed because synaptic weights is biased toward the normal Gaussian and bears
only one out of the J elements of p nk is equal to one, close similarities with the , 2 regularization when it is enforced
and the rest are zeros. The jth element becomes one with on the synaptic weights.
probability Pnkj . Drawing samples to approximate the expectations in (62) is
Note that besides the previous random variables, which closely related to what we previously called reparameterization.
are directly related to the DNN architecture, there is another Recall that in the framework of the backpropagation algorithm,
set of hidden random variables, i.e., u j, that are used for gen- during the forward pass, one needs specific values/samples of the
erating the IBP prior. These have also to be considered as involved random parameters to compute the outputs, given the
part of the palette of the involved random variables. As stat- input to the network. This is performed via sampling the respec-
ed in the “Indian Buffet Process Prior” section, these follow tive distributions, based on the current estimates of the involved
the beta distribution, with prior beta (u j ; a, 1) and posteriors posteriors. However, to be able to optimize with respect to their
beta (u j ; a j, b j), where a j, b j are learned during training. To defining hyperparameters, the corresponding current estimates
train the proposed model, we resort to the maximization of should be explicitly considered. Let us take the Gaussian synap-
the ELBO. The trainable model parameters, in our case, are tic weights as an example. Let N (w ikj ; n ikj, g ikj) be the current
the set of all the weights’ posterior means and variances estimate of some weight in some layer. Instead of sampling from
(i.e., n ikj and g ikj); the synaptic utility indicator posterior this Gaussian, it is easy to see that it is equivalent to obtaining a
probabilities, ru ik; and the stick variable posterior param- corresponding sample of the weight as
eters a j and b j across all network blocks and layers. Let
us refer to this set as H. We assume that our task includes u ikj = n ikj + g 1ikj/2 $ e,
w e + N (0, 1). (63)
C classes and that the softmax nonlinearity is used in the
output layer. In this way, every link in the network is determined explic-
In the following, we denote D as the input–output train- itly by the pair (n ijk, g ijk), and the backpropagation optimizes
ing dataset. In addition, let Z be the set of the synaptic utility with respect to the means and variances. The reparameterization
indicators across the network layers, N be the set of winner of the rest of the involved random variables follows a similar
unit indicators across all blocks of all layers, W be the set rationale, yet the involved formulae are slightly more complex;
of synapse weights across all layers, and U be the set of the however, they are still given in terms of analytic expressions.
stick variables of the sparsity-inducing priors imposed across For example, for the utility variables, z ik, reparameterization
the network layers. Employing the MF approximation on the is achieved via the posteriors ru ik and the so-called Gumbel–
joint posterior pdf, i.e., factorizing q (W, Z, N, U), it is readily Softmax relaxation, e.g., [67]. For the stick-breaking variables,
shown, e.g., [2], that the reparameterization is achieved via the so-called Kuma-
raswamy approximation [68]. Details and the exact formulae
N C can be found in [2]. Once samples have been drawn for all the
ELBO (H) = E q [ / / y nc ln yu nc (x n; W, Z, N, U)]
n=1 c=1
involved variables, the ELBO in (62) is expressed directly in
terms of the drawn sample, i.e, W u ,Z u , and U
u, N u , without any
p (Z ; U) p (U)
+ E q ln + E q ln  expectation being involved. The trainable parameters set, W,
q (Z) q (U)
14444444244444443 can be obtained by means of any off-the-shelf gradient-based
regularizing terms
p (N) p (W) optimizer, such as Adam [53]. Note that by adopting the report-
+ E q ln + E q ln (62) ed reparameterizations, one yields low-variance stochastic gra-
q (N ; Z, W) q (W)
1444444442444444443
regularizing terms dients that ensure convergence.
Training a fully Bayesian model slightly increases the
where y nc are the outputs in the training set and yu nc, c = required complexity since more parameters are involved; e.g.,
1, 2, f, C, are the class probability outputs as estimated via the instead of a single weight, one has to train with respect to the
softmax nonlinearities that implement the output layer. Observe respective mean value and variance as well as the hidden util-
that the first term on the right-hand side is the negative expecta- ity variables. However, the training timing remains of the same
tion of the cross entropy of the network. The only difference order as that required by the deterministic versions. Once train-
with the deterministic DNNs that use this function to optimize ing has been completed, during testing, given an input, the fol-
the network is that now the expectation over the posterior is in- lowing apply:
volved. In practice, it turns out that drawing one sample from 1) One can use the mean values of the obtained posterior
the involved distributions suffices to lead to good approxima- Gaussians, n ijk, in place of the synaptic weights, w ijk .

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 41


Sampling from the distribution could also be another possi- Applications in signal processing
bility. Usually, the mean values are used. and machine learning
2) One can employ a threshold value, e.g., x, and remove all In this section, we showcase typical applications of the sparsi-
links where the corresponding posterior ru is below this ty-promoting data analysis tools introduced in the “The Art of
threshold. Sampling is also another alternative. Prior: Sparsity-Aware Modeling for Three Case Studies” sec-
3) One can sample from the respective categorical distribu- tion. More specifically, advanced time series prediction using
tions to determine which linear unit “fires” in each GP models is demonstrated in the “Time Series Prediction via
block. Selecting the one with the largest probability is GPs” section, adversarial learning using Bayesian DNNs is pre-
another alternative. sented in the “Adversarial Learning via Bayesian DNNs” sec-
tion, and, finally, social group clustering and image completion
Remark 5 using unsupervised tensor decompositions are demonstrated in
At this point, we must stress that the learned posterior variances the “Unsupervised Learning via Bayesian Tensor Decomposi-
over the network weights can be also used for reducing the float- tions” section.
ing-point bit precision required for storing them; this effectively
results in memory footprint reduction. The main rationale be- Time series prediction via GPs
hind this process consists of the fact that the higher the posterior In the following, we present an important signal processing
weight variance, the more their fluctuation under sampling at and machine learning application, namely, time series predic-
inference time. This implies that, eventually, some bits fluctu- tion, using nonparametric GPs. We focus on the GP regression
ate too much under sampling, and therefore, storing and retain- models with the family of sparse spectrum kernels introduced in
ing their values does not contribute to inference accuracy. Thus, the “Sparsity-Aware Modeling for GPs” section. To demonstrate
Bayesian methods offer this added luxury to optimally control the advantages of the sparsity-promoting GP models over other
the required bit precision for individual nodes. For example, in competing counterparts, we selected a number of classic time
[2], it is reported that for the case of LeNet 300-100 trained on series datasets (these datasets are available from the University
the Modified National Institute of Standards and Technology da- College London re­pository), such as CO 2, Electricity, Unem-
tabase, the bit precision can be reduced from 23-bit mantissa to ployment as well as a “fresh” real-world 5G wireless traffic da-
just 2 bits. For more details, see, e.g., [2]. taset in our tests. Data descriptions are given in Table 2.

Remark 6 Classic datasets


In [2], it was shown that the resulting architectures are able to In the following, we compare the performance of the sparsity-
yield a significant reduction in their computational footprint, promoting GP models, using the original SM kernels [36],
retaining, nevertheless, state-of-the-art performance; positively, [37] and the modified GridSM kernel [7], with that of a classic
the flexibility of the link-wise IBP allowed for a more potent deep learning-based time series prediction model, namely, long
sparsity activation-aware blend based on stochastic LWTA, of- short-term memory (LSTM) [69], as well as a canonical statisti-
fering significant benefits compared to conventional nonlineari- cal model, namely, the autoregressive integrated moving aver-
ties introduced by the activation functions, such as sigmoid and age (ARIMA) model [70], from various different aspects. (The
ReLU. For example, in the case of the Canadian Institute for ARIMA model can be regarded as a special case of a GP with
Advanced Research (CIFAR)-10 VGG-like network, it turns out a specific sparse kernel matrix.) Furthermore, we compare GP
that only around 5% of the original nodes are retained, without models with recently proposed Transformer-based time series
affecting the performance of the network, even though quantized prediction model, called Informer, which successfully addressed
arithmetic was also employed to reduce the required number of the computation issues and some inherent limitations of the en-
bits, as explained in the previous remark; see, e.g., [2]. coder–decoder architecture in the original Transformer model.

Table 2. The descriptions of the selected datasets.

Name Data Description Training D Test D )


ECG Electrocardiography of an ordinary person, measured over a period of time 680 20
Carbon dioxide Carbon dioxide concentration observed from 1958 to 2003 481 20
Electricity Monthly average residential electricity usage in Iowa City, Iowa, USA, from 1971 to 1979 86 20
Employment Wisconsin, USA, employment status observed from January 1961 to October 1975 158 20
Hotel Monthly occupied hotel rooms, collected from 1963 to 1976 148 20
Passenger Passenger miles flown domestically in the United Kingdom from July 1962 to May 1972 98 20
Clay Monthly production of clay bricks from January 1956 to August 1995 450 20
Unemployment Monthly U.S. female (16–19 years old) unemployment figures from 1948 to 1981 380 20
5G wireless traffic Downlink data usage in a small cell observed in four weeks of 2021 607 67

ECG: electrocardiogram.
The training data, D, are used for optimizing the hyperparameters of the learning model, while the test data, D ), are used for evaluating the prediction accuracy. The num-
bers given in the last two columns are the training sample size and test sample size, respectively.

42 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


For brevity, we name the first sparse spectrum GP model (equiv- fies the most effective frequency components of the data and
alent to using trigonometric basis functions) proposed in [36] thus leads to good model interpretability.
SSGP, the GP model using the original SM kernel [37] SMGP,
and the most recent one, with the rectified GridSM kernel [7], Real 5G dataset
GSMGP. Their configurations can be found in detail in [7]. Next, we focus on another favorable advantage of the GP mod-
Table 3 shows the obtained prediction accuracy of the vari- els over their deep learning counterparts, namely, the natural
ous methods quantified in terms of the prediction MSE. It is uncertainty region of a point prediction. We specifically select
readily observed that the sparsity-promoting GP models, and in the real 5G wireless traffic dataset for visualization purposes,
particular, the SMGP and GSMGP, outperform all other com- due to the high demand of such a wireless application for a rea-
petitors by far. The following facts need to be mentioned. For sonable prediction uncertainty [25], [72]. The dataset was col-
both the classic LSTM and Informer models to achieve good lected in a small cell of a southern city in China, and it contains
performance, the time series should, in general, be long so that the downlink data volume consumed by the mobile users lo-
the underlying pattern can be learned during the training phase. cated in the cell within each hour during a period of four weeks.
The ARIMA model strongly relies on the optimal configuration Accurate prediction of the future downlink data consumption is
of the parameters (p, d, and q), and it is incapable of long-term vital to the operators for tuning the transmit power of the base
prediction. In contrast, the sparsity-promoting GP models can station and switching it on/off automatically. In this application,
automatically fit the underlying data pattern through solving the uncertainty information is even more crucial because wrongly
hyperparameters by maximizing the evidence function. We have reducing the transmit power may largely influence the mobile
also shown in the supplement of [7] that the GSMGP is also users’ surfing experience.
superior to the GP models with an elementary kernel (such as For this 5G wireless traffic prediction example, we con-
the SE kernel, rational quadratic kernel, and so on) as well as a strained ourselves to a GSMGP with Q = 500; an SMGP with Q
hybrid of those. = 500; a standard GP with a hybrid of three elementary kernels
Besides the improved prediction performance, the training (two periodic kernels plus an SE kernel), as used in [25]; and
time of the GSMGP outperforms the original SMGP by far a classic LSTM deep learning model, as described previously.
(here, training time refers to the computational time required for We used the data collected in the first 607 samples for training
training the learning models introduced in the “The Art of Infer- the models and used the last 67 samples to test their prediction
ence: Evidence Maximization and Variational Approximation” performance. For comparing their prediction accuracy, we chose
section). For the selected datasets, the SMGP requires training the mean absolute percentage error (MAPE) measure, which is
time on the magnitude of 103 s, while the GSMGP requires commonly used for evaluating the wireless traffic prediction
only 102 s. By reducing the number of Gaussian modes, Q, error. The MAPE averaged over multiple test points is given by
the SMGP is able to reduce its training time, albeit at the cost
n)
y i - yt i
of sacrificing the fitting performance. On the other hand, the e MAPE = 1 / # 100%.(64)
GSMGP improves its training time by fixing the frequency and n ) i=1
yi
variance parameters of the original SM kernel to known grids
so that the evidence maximization task enjoys the favorable The prediction performance of the preceding learning mod-
difference-of-convex structure that can be efficiently handled by els is detailed in Figure 14. It is readily seen that the GSMGP
the MM algorithm introduced in the “Inference Algorithms for gives the best point prediction in terms of the MAPE. More-
GP Regression” section. In addition to the reduced training time, over, as we mentioned before, the focus of this example is
the overall optimization performance
(including the convergence speed,
chance of being trapped in a bad local Table 3. The comparisons of various time series prediction models in terms of the prediction MSE.
minimum, and so on) and the sparsity
level of the solution have been signifi- Name GSMGP SSGP SMGP LSTM Informer ARIMA
cantly improved. Detailed comparisons MSE MSE MSE MSE MSE MSE
and pictorial illustrations can be found ECG 1.3E - 02 1.6E - 01 1.9E - 02 2.1E - 02 5.4E - 02 1.8E - 01
CO2 1.5E + 0 2.0E + 02 1.1E + 0 2.1E + 0 8.4E + 01 4.9E + 0
in [7]. In comparison, the LSTM and
Electricity 4.7E + 03 8.2E + 03 7.5E + 03 4.7E + 03 8.3E + 03 1.2E + 04
ARIMA models require the least train- Employment 1.1E + 02 7.7E + 01 0.7E + 02 4.3E + 02 2.0E + 03 3.9E + 02
ing time, on the magnitude of 101 s, on Hotel 8.9E + 02 1.9E + 04 2.8E + 03 7.8E + 03 2.3E + 04 1.7E + 04
average. However, due to the huge archi- Passenger 1.9E + 02 6.9E + 02 1.6E + 02 1.6E + 02 1.2E + 02 4.5E + 03
tecture adopted in the Informer model, Clay 1.9E + 02 5.3E + 02 3.3E + 02 2.7E + 02 1.4E + 02 3.3E + 02
the computational time is on the mag- Unemployment 3.6E + 03 2.1E + 04 1.4E + 04 3.5E + 03 3.8E + 03 1.5E + 04
nitude of 102 s, on average. As readily
Herein, we let both the GSMGP and the SMGP employ Q = 500 Gaussian mixture modes in their kernels, and the
observed from the results, the sparsity- SSGP employs the same number of trigonometric basis functions. The GSMGP samples Q normalized frequency
promoting property helps reduce the parameters, μi, i = 1, 2, g, Q, uniformly from [0, 1/2) while setting the variance parameter to v = 0.001. The
LSTM model follows a standard setup with one hidden layer, and the dimension of the hidden state is set to 30. The
computational time significantly; more Informer model follows the default setup given in the original paper [71]. The ARIMA (p, d, q) model is a standard
importantly, the sparse solution identi- one, with p = 5, d = 1, and q = 2.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 43


primarily on the uncertainty quantification. For GPs, the desired level is modestly larger that its counterpart. This suggests that,
uncertainty region can be obtained naturally by computing the in this case, the SMGP is less favorable because wrong decisions
posterior variances associated with the test samples. In contrast, about switching on/off the base station are more likely made.
the classic LSTM model can provide only point predictions The preceding fitting results clearly demonstrate the advan-
without any uncertainty quantification. A recent technique using tages of the sparse spectrum kernel-based GP models; however,
the so-called deep ensembles [73] can be applied to quantify the obtained performance depends on the quality of the initial-
the predictive uncertainty of the LSTM model. The common ization. In particular, the method is sensitive to the initial guess
characteristic of these techniques lies in that one has to train the of the SM kernel. According to our experience, a reliable ini-
models multiple times, using different configurations (such as tial guess can be obtained by fitting a periodogram (namely, a
different initial guesses, step sizes, and so on). However, such nonparametric approximation of the true spectral density) in the
an approach increases substantially the computational load frequency domain. We could also combine this strategy with
compared to the Bayesian approach, especially when complex the bootstrap technique [74] to generate a number of candi-
(deterministic) learning models are involved. When comparing date initial guesses for avoiding bad local minima. Codes for
the uncertainty regions of the GP models, we can observe that implementing the GSM kernel-based GP model are available at
the one using a mixture of elementary kernels tends to be conser- https://github.com/Paalis/MATLAB_GSM.
vative and shows the largest uncertainty region. In contrast, both
the SMGP and GSMGP provide rather accurate point predic- Adversarial learning via Bayesian DNNs
tions as well as smaller uncertainty levels. It is noteworthy that Despite the widespread success of DNNs, recent investigations
the SMGP presents less accurate point prediction (using its pos- have revealed their high susceptibility to adversarial examples,
terior mean) compared to that of the GSMGP, but its uncertainty that is, cleverly crafted examples whose sole purpose is that of

6,000 6,000
5,000 5,000
Downlink Data Usage

Downlink Data Usage

4,000 4,000
3,000 3,000
2,000 2,000
1,000 1,000
0 0
–1,000 –1,000
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time Time
Training Data Prediction Training Data Prediction
Test Data Uncertainty Region Test Data Uncertainty Region
(a) (b)

6,000 6,000
5,000 5,000
Downlink Data Usage

Downlink Data Usage

4,000
4,000
3,000
3,000
2,000
2,000
1,000
0 1,000

–1,000 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Time Time
Training Data Prediction Training Data Test Data Prediction
Test Data Uncertainty Region
(c) (d)

FIGURE 14. A comparison of 5G wireless traffic prediction performance obtained from different models. (a) A GSMGP model with Q = 500 fixed grids
whose e MAPE = 0.28. (b) A SMGP model with Q = 500 modes whose e MAPE = 0.42. (c) A standard GP with a hybrid of three elementary kernels whose
e MAPE = 0.3. (d) An LSTM model whose e MAPE = 1.12. The gray shaded areas represent the uncertainty region (computed as the posterior variances) of
the GP models.

44 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


fooling a considered model into misclassification. Adversarial We investigate the potency of LWTA-based networks against
examples can be constructed using various approaches, e.g., adversarial attacks under an adversarial training regime; we
the fast gradient sign method [75] and the Carlini and Wagner employ a PGD adversary [77]. To this end, we use the well-
method [76]. A popular and powerful attack is the projected gra- known WideResNet-34 [84] architecture, considering three dif-
dient descent (PGD) attack [77]. Under this scheme, the adver- ferent widen factors: one, five, and 10; note, from the definition
sary is assumed to have access to the objective function of the of the WideResNet-34, the larger the widen factor, the larger the
target model, L (w, x, y), where w denotes the model trainable network. We focus on the CIFAR-10 dataset and adopt experi-
parameters, x the input, and y the predicted output variables. On mental settings similar to [85]. We use a batch size of 128 and
this basis, the adversary performs an iterative computation; at an initial learning rate of 0.1; we halve the learning rate at every
each iteration, t, the adversary computes an (e.g., , 3-bounded) epoch after the 75th epoch. We use a single sample for pre-
adversarial perturbation of the training set examples x, based on diction. All experiments are performed using a single Nvidia
a multistep PGD procedure that reads Quadro P6000.
For evaluating the robustness of this structure, we initially
xt + 1 = % (x t + a sgn (d x L (w, x, y))) (65) consider the conventional PGD attack with 20 steps, a step
x+S
size of 0.007, and e = 8/255, which are the two parameters
where S is the set of allowed perturbations, that is, the manipu- required by the PGD. In Table 4, we compare the robustness of
lative power of the adversary, e.g., the , 3 ball around x; sgn ($) LWTA-based WideResNet networks against the baseline results
denotes the sign function that extracts the sign of a real number. of [85]. As we observe, the stochastic LWTA-based networks
In this context, even some minor (and many times, impercepti- yield significant improvements in robustness under a traditional
ble) modifications can successfully “attack” the model, resulting PGD attack; they retain extremely high natural accuracy (up to
in severe performance degradation. This frailness of DNNs casts . 13% better), while exhibiting a staggering (up to . 32.6%)
serious doubt on their deployment in safety-critical applications, difference in robust accuracy compared to the exact same archi-
e.g., autonomous driving [78]. tectures employing the conventional ReLU-based nonlinearities
Drawing upon this vulnerability, significant research effort and trained in the same fashion. Natural accuracy refers to the
has been recently devoted to more reliable and robust DNNs. performance based on nonadversarial examples, while robust
On this basis, several adversarial attacks and defenses have accuracy refers to the case where the network is tested against
been proposed in the literature, e.g., adversarial training [77], adversarial examples.
[79], [80]. Among these lies the stochastic modeling rationale; Further, to ensure that this approach does not cause the
its main operating principle is founded upon the introduction of well-known obfuscated gradient problem [89], stronger param-
stochasticity in the considered architecture, e.g., by randomiz- eter-free attacks were adopted using the newly introduced
ing the input data and/or the learning model itself [81], [82], AutoAttack (AA) framework [90]. AA consists of an ensemble
[83]. Clearly, the Bayesian reasoning, which treats parameters of four powerful white-box and black-box attacks, e.g., the
as random entities instead of deterministic values, seeking to commonly employed AutoPGD (APGD) attack; this is a step-
infer an appropriate generative process, seems to offer a natu- free variant of the standard PGD attack [77], which avoids the
ral stochastic defense framework toward more adversarially complexity and ambiguity of step size selection. In addition,
robust networks. It must be emphasized that the Bayesian tech- for the entailed L 3 attack, the common e = 8/255 value was
niques differ from the more standard randomized ones, which used. Thus, in Table 5, we compare the LWTA-based networks
simply rely on the randomization of deterministic variables, to several recent state-of-the-art approaches evaluated on AA
in the context of the standard deterministic NNs. Such tech- (https://github.com/fra31/auto-attack). The reported accuracies
niques can be fairly easily handled and attacked. In contrast, in correspond to the final reported robust accuracy of the meth-
the Bayesian framework, the whole modeling and the learning ods after sequentially performing all the considered AA attacks.
are built upon statistical arguments, and the training involves Once again, we observe that these stochastic and sparse net-
learning of distributions. works yield state-of-the-art robustness against all state-of-the-art
Therefore, in the following, we focus on a recent application
of the Bayesian rationale toward adversarial robustness. Specifi-
Table 4. The natural and robust accuracy under a conventional
cally, we present the novel Bayesian deep network design para-
PGD attack.
digm proposed in [3] that yields state-of-the-art performance
against powerful gradient-based adversarial attacks, e.g., PGD Adversarial Training–PGD
[77]. The key aspect of this method is its doubly stochastic Natural Accuracy (%) Robust Accuracy (%)
nature stemming from two separate sampling processes relying
Widen Factor Baseline Stochastic LWTA Baseline Stochastic LWTA
on Bayesian arguments: 1) the sparsity-inducing nonparametric
1 74.04 87 49.24 81.87
link-wise IBP prior introduced in the “Sparsity-Aware Modeling 5 83.95 91.88 54.36 83.4
Using IBP Prior” section and 2) a stochastic adaptation of the 10 85.41 92.26 55.78 84.3
biologically inspired and competition-based LWTA activation,
The attack has 20 steps and a 0.007 step size, using WideResNet-34 models with
as discussed in the “Sparsity-Aware Modeling Using IBP Prior” different widen factors. We use the same PGD-based adversarial training scheme
and “Inference Algorithms for Bayesian DNNs” sections. for all models [77].

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 45


methods, with an improvement of . 16.72%, even when com- the “Inference Algorithms for Bayesian Tensor Decomposi-
pared with methods that employ substantial data augmentation tions” section, the resulting algorithms offer nice features of
to increase robustness, e.g., [88]. More results are reported in [3] bypassing the need of hyperparameters tuning and in dealing
and [4]. All results vouch for the potency of stochastic LWTA with overfitting.
networks in adversarial settings.
Finally, since the newly proposed networks consist of sto- Bayesian tensor CPD for social group clustering
chastic components, i.e., the competitive random sampling pro- In contrast to matrix decompositions, which are not unique, in
cedure to determine the winner in each LWTA block, the output general (unless certain constraints are imposed), tensor CPD
of the classifier might change at each iteration; this obstructs is provably unique under mild conditions [43]. This appealing
the attacker from altering the final decision. To counter such property has made CPD an important tool for extracting the
randomness in the involved computations, in [90], the APGD underlying signals/patterns from the observed data. The inter-
attack is combined with an averaging procedure of 20 computa- pretability of the CPD model can be further enhanced by incor-
tions of the gradient at the same point. This technique is known porating some side information, e.g., nonnegativeness [11], into
as expectation over transformation (EoT) [89]. Thus, AA was model learning. Here, using the Enron Corpus e-mail dataset (a
used jointly with EoT for further performance e­ valuation of the 3D tensor with the size 184 # 184 # 44), we demonstrate how
LWTA-based networks. The corresponding results are presented the Bayesian tensor CPD (with nonnegative factor matrices)
in Table 6. As we observe, all the considered networks retain [11] can be used to simultaneously determine the number of
state-of-the-art robustness against the powerful AA and EoT social groups, cluster people into different groups, and extract
attacks. This conspicuously supports the usefulness of the sto- interpretable temporal profiles of different social groups.
chastic LWTA activations toward adversarial robustness. Further The considered dataset records the number of e-mail
explanations on why this performance is obtained are provided exchanges among 184 people within 44 months. In particular,
in, e.g., [2]. each entry is the number of e-mails exchanged between two peo-
ple within a certain month. The physical meaning of three tensor
Unsupervised learning via Bayesian tensor decompositions dimensions is 1) the people who sent e-mails, 2) the people who
In this section, we present some recent advances of Bayesian received e-mails, and 3) the months, respectively. After apply-
tensor decompositions in two unsupervised learning applica- ing Bayesian tensor CPD, the automatically determined ten-
tions: social group clustering and image completion. The first sor rank can be interpreted as the number of underlying social
application adopts Bayesian tensor CPD [11], [14], while the groups. For the first two factor matrices, the physical meaning
second one employs Bayesian tensor TTD [49], [50], [51]. Ex- for each element is that it quantifies the “score” that a particu-
ploiting the GSM-based sparsity-promoting prior, as introduced lar person belongs to a particular e-mail sending and receiving
in the “Sparsity-Aware Modeling for Tensor Decompositions” group, respectively. For the third factor matrix, each column cor-
section, and the effective MF-VI inference, as introduced in responds to the temporal profile of the associated social group;
see the discussion in [11].
Table 5. The robust accuracy (%) comparison under the AA framework.
Typically, we set the initial number of the social groups
(i.e., the tensor rank) large, (e.g., the minimal dimension of
Method AA the tensor data, 44, in the preceding example) and then run
Hypersphere embedding [86] 53.74 the Bayesian learning algorithm [11] to automatically deter-
Width adjusted regularization [85] 54.73 mine the number of social groups that best interprets the data.
Pretraining [87]† 54.92 As seen in Figure 15, the estimated number of social groups
Data augmentation [88]† 65.88
gradually reduces to the value four, indicating four underlying
Width adjusted regularization [85]† 61.84
social groups. This is consistent with the results published in
Stochastic LWTA/PGD/WideResNet-34-1 [4] 74.71
Stochastic LWTA/PGD/WideResNet-34-5 [4] 81.22 [44] and [91], which are obtained via trial-and-error experi-
Stochastic LWTA/PGD/WideResNet-34-10 [4] 82.6 ments. The clustering results can be read from the first factor

matrix, which is of size 184 # 4. Specifically, for each column,
Models are trained with additional unlabeled data.
The AA performance corresponds to the final robust accuracy after employing all only a few elements have nonzero values, and they can be used
the attacks in AA. Results are directly from the AA leaderboard. to identify the significance of the corresponding people in this
social group. After sorting the scores of each column in the
first factor matrix, the people with top 10 scores in each social
Table 6. The robustness against AA, combined with 20 iterations
group are shown in [11, Table 5]. The clustering results are
of the EoT.
well interpretable, as illustrated in Figure 15. For example, the
Widen Factor Natural Accuracy APGD APGD-DLR people in the first group work either in legal department or as
1 87 79.67 76.15 lawyers and thus are clustered together. Moreover, interesting
5 91.88 81.67 77.65 temporal patterns can be observed from the third factor matrix.
10 92.26 82.55 79 It is clear that when the company has important events, such
The APGD-DLR corresponds to the APGD attack, using a different loss, i.e., the differ- as a change of the chief executive officer, crises, and bank-
ence of logits ratio [90]. DLR: difference of logits ratio. ruptcy, distinct peaks appear (this information can be acquired

46 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


via checking the e-mail content and related research works; has P – 1 hyperparameters (called TT ranks). Manually tuning
see, e.g., [44] and [91]). The e-mail data analysis results have different combinations of these hyperparameters for overfit-
showcased the appealing advantage of Bayesian SAL in the ting avoidance is time-consuming. To facilitate this process,
context of tensor CPD, that is, the automatic determination of recent advances (see, e.g., [49], [50], and [51]) first assume
the social group number. This is important since it leads to large values for the TT ranks, employ the sparsity-promoting
interpretable results about group members, and temporal pro- GSM prior, and use VI for effective Bayesian SAL. The result-
files can be naturally obtained. ing algorithms can automatically learn the most suitable TT
ranks to match the underlying data pattern [49], [50], [51].
Bayesian tensor TTD for image completion As an illustration, we consider the image completion
Color images are a naturally 3D tensor (with two spatial di- of five images. Each image has size 256 # 256 # 3, and
mensions and one red–green–blue dimension). To fully exploit 80% of its pixels are randomly removed. After tensor aug-
the inherent structures of images, recent image completion mentation, the images are folded into a 9D tensor with size
works [92], [93], [94] usually fold an image into a higher di- 16 # 4 # 4 # 4 # 4 # 4 # 4 # 4 # 3. We assume that the ini-
mensional tensor (e.g., a 9D tensor) and then apply tensor de- tial TT ranks are set as large as 60 and apply the Bayesian
compositions to recover the missing pixels, among which TTD TTD algorithm [51] to complete the missing pixels. In com-
is one of the most important tools due to its excellent perfor- parison, we present the image completion results from other
mance. The folding operation is called tensor augmentation; recent TTD algorithms: TTC-TV [92], TMAC-TT [93], and
see details in, e.g., [92], [93], and [94]. For a P-D tensor, TTD STTO [94], with the suggested hyperparameter settings in

50
Social Group Number Estimates

Initial Guess: 44 Groups


40

30 Automatically
Determine
The Bayesian learning algorithm the Social Group
20 suggests that four groups are Number
capable of data interpretation.
10

0
20 40 60 80 100
Iteration Number
Social Group Clustering
Government
Legal Affairs
Clustering Results
(Name List Was
Given in [9, Table 5])
Trading/Top
Pipeline
Executive

Bayesian
Tensor CPD
0.15
Crisis Breaks/Investigations
Temporal Profiles

Change
0.1 of CEO
Trading/
Enron E-mail Corpus Interpretable
Top Executive Bankruptcy
Temporal Profiles
0.05 Pipeline

0
10 20 30 40 50
Month

FIGURE 15. The Bayesian tensor CPD for social group clustering. CEO: chief executive officer.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 47


their papers. The widely used metrics, e.g., the peak SNRs based methods, can effectively help to avoid data overfit-
(PSNRs), were reported in [51, Table 2], from which it can ting, especially when the data size is relatively small; and 4)
be concluded that the Bayesian TTD algorithm achieves the sparsity-promoting priors lead to natural and more reasonable
best overall performance. In particular, in most cases, the uncertainty quantification, which is hard to obtain via tradi-
Bayesian TTD algorithm recovers images with a 1–5-dB tional deep learning models.
higher PSNR than other algorithms. This is visually evident Despite the rapid development of Bayesian learning and
in the recovered images in Figure 16. In this example, the the enumerated advantages of sparsity-promoting models,
Bayesian SAL-based tensor TTD [51] gets rid of the costly such models are confronted with some challenges. Some open
hyperparameter tuning process for balancing the tradeoff research directions are summarized as follows:
between data fitting and noise overfitting; it directly learns ■■ Quality of the posterior/predictive distribution: As we
these hyperparameters from observations and shows excel- mentioned before, a unique feature of Bayesian learning
lent image restoration performance. models lies in their posterior distribution that can be used
The following suggestions are provided for the real imple- to generate a point prediction and, meanwhile, provide an
mentations of Bayesian tensor decomposition algorithms; uncertainty quantification of the point prediction. Various
1) Initialization: To prevent the algorithm from being trapped recent works [23], [95], [96] indicate that the quality of
in poor local minima, the initial factor matrix is usually the posterior distribution that is derived via the Bayesian
set equal to the singular value decomposition approxima- DNNs and GP models can be significantly improved by
tion of the matrix, which is obtained by unfolding the ten- using cold tempering:
sor data along a specific dimension; see, e.g., [10], [11],
[14], and [51]. p (i ; X, y) ? ( p ( y ; X, i) $ p (i))1/T , T 1 1. (66)
2) On-the-fly pruning: To accelerate the learning process
while not affecting the convergence, in each iteration, if ■■ Two conjectures lie in the misspecification of the learning
some of the columns in the factor matrices are found to model and careless adoption of an inadequate unintention-
be indistinguishable from an all-zeros column vector, ally informative prior [95]. Deeper analysis of such behav-
they can be safely pruned; see, e.g., the discussion in ior is highly demanded. It is of great value to verify, either
[10] and [11]. analytically or experimentally, whether adopting the spar-
3) Robustness against strong noise: When the corrupting sity-promoting priors can help to avoid the use of cold
noise sources are of large power, it was shown in [10] that tempering. It is also interesting to investigate the general-
slowing the noise precision learning can increase the ization property of the sparsity-promoting Bayesian learn-
robustness of the algorithm. Demonstration codes of ing models.
Bayesian tensor CPD and TTD algorithms are online at ■■ More emerging applications in complex systems: We
https://github.com/leicheng-tensor?tab=repositories. have witnessed various applications of Bayesian learning
models, and they will surely continue to play important
Concluding remarks and future research role in large and complex systems, such as 6G wireless
In this article, we have presented an overview of some state-of- communication systems [25], [97] and autonomous sys-
the-art sparsity-promoting priors for both Bayesian linear and tems [78], that are constantly facing rapidly changing
nonlinear modeling as well as parametric and nonparametric environments and critical decision making. Sparsity-
models. In particular, these priors were incorporated into three promoting models are flexible enough to adapt them-
advanced data analysis tasks, namely, GP models, Bayesian selves (for instance, by nulling irrelevant basis kernels in
DNNs, and tensor decomposition models, that can be applied the GP models) to changing data profiles and provide
to a wide spectrum of signal processing and machine learning rather reliable uncertainty quantification with small com-
applications. Commonly used inference algorithms for esti- putational expense.
mating the associated hyperparameters and the (approximate) ■■ More and tighter interactions of the three data analysis
posterior distributions were also discussed. tools: Each of the three data analysis tool (introduced in
To demonstrate the effectiveness of the considered this article) has already tapped into the design of other
advanced sparsity-promoting models, we have carefully tools (see, e.g., DNN and GP [16], GP and tensor [98],
selected four important use cases, namely, time series predic- and DNN and tensor [99]) to achieve performance
tion via the GP regression, adversarial learning via Bayesian enhancement by borrowing the strengths of other tools.
DNNs, social group clustering, and image completion using However, many of these works are not under the frame-
tensor decompositions. The reported results indicate that 1) work of Bayesian SAL and thus do not possess the asso-
sparsity-promoting priors are able to adapt themselves to the ciated comparative advantages. It is promising to
given data and enable automatic model structure selection; 2) investigate how to combine the strengths of the three pop-
the resulting sparse solution can better reveal the underlying ular models, especially under the Bayesian SAL umbrel-
(physical) characteristics of a target system/signal with only la, to tackle challenging tasks, such as nonlinear regression
a few effective components; 3) sparsity-promoting priors, for multidimensional and even heterogeneous data with
acting as the counterparts of the regularizers in optimization- deep kernels.

48 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


(a)

(b)

(c)

(d)

(e)

(f)

FIGURE 16. The experimental results for visual comparison on image completion with 80% missing data. (a) The ground truth images. (b) The images
with missing values. The results from (c) Bayesian TTD, (d) TTC-TV, (e) TMAC-TT, and (f) STTO.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 49


■■ Sparsity awareness in emerging learning paradigms: 1978. He is currently a professor emeritus in the Department
Recently, we have witnessed various new paradigms, of Informatics and Telecommunications, National and
including, for instance, federated learning, lifelong learn- Kapodistrian University of Athens, Athens 15784, Greece,
ing, meta learning, and so on. We strongly believe that by and a distinguished professor at Aalborg University, Aalborg
further encoding sparsity awareness through Bayesian 9220, Denmark. His research interests lie in the cross section
sparse learning strategies, these emerging learning para- of signal processing and machine learning. He is the author
digms can further improve their learning efficiency and of the book, Machine Learning: A Bayesian and
generalization capability over the learning models that Optimization Perspective, second edition, published by
were introduced in this article. Academic Press in 2020, the coauthor of the best-selling
book, Pattern Recognition, fourth edition, published by
Acknowledgment Academic Press in 2009, and coauthor of the book,
The work of Lei Cheng was supported, in part, by the Nation- Introduction to Pattern Recognition: A MATLAB Approach,
al Natural Science Foundation of China (NSFC), under grant published by Academic Press in 2010. He is the coauthor of
62001309; Science and Technology on Sonar Laboratory, seven papers that have received Best Paper Awards, including
under grant 6142109KF212204; 100-Talents Start-Up Fund, the 2014 IEEE Signal Processing Magazine Best Paper award
College of Information Science and Electronic Engineering, and the 2009 IEEE Computational Intelligence Society
Zhejiang University; and Zhejiang Provincial Key Laboratory Transactions on Neural Networks Outstanding Paper Award.
of Information Processing, Communication, and Networking. He is the recipient of the 2021 IEEE SP Society Norbert
The work of Feng Yin was supported, in part, by Guangdong Wiener Award, the 2017 EURASIP Athanasios Papoulis
Zhujiang Project 2017ZT07X152 and the NSFC, under grant Award, the 2014 IEEE Signal Processing Society Carl
62271433. The work of Tsung-Hui Chang is supported by Friedrich Gauss Education Award and the 2014 EURASIP
the Shenzhen Science and Technology Program, under grant Meritorious Service Award. He has served as the vice presi-
JCYJ20190813171003723; NSFC, under grants 62071409 dent of the IEEE Signal Processing Society, editor in chief of
and 61731018; and Guangdong Provincial Key Laboratory the IEEE Transactions on Signal Processing, and as president
of Big Data Computing. The work of Sotirios Chatzis has of the European Association for Signal Processing
received funding from the European Union’s Horizon 2020 (EURASIP). He currently serves as the chair for the IEEE
research and innovation program, under grant 872139 (Arti- SPS awards committee. He is a Fellow of EURASIP and a
ficial Intelligence for the Deaf project). Feng Yin is the corre- Life Fellow of IEEE.
sponding author. Feng Yin and Lei Cheng contributed equally Sotirios Chatzis ([email protected]) received his
to this work. M.Eng. degree in electrical and computer engineering and his
Ph.D. degree in machine learning from the National
Authors Technical University of Athens in 2005 and 2008, respective-
Lei Cheng ([email protected]) received his B.Eng. ly. He is currently an associate professor in the Department of
degree from Zhejiang University in 2013, and his Ph.D. Electrical Engineering, Computer Engineering, and
degree from the University of Hong Kong in 2018. He is an Informatics at the Cyprus University of Technology, Limassol
assistant professor (ZJU Young Professor) in the College of 3036, Cyprus, and serves as the elected department chair. He
Information Science and Electronic Engineering, Zhejiang currently serves as PI of several research projects funded by
University, Hangzhou 310058, China. He was a research sci- the European Commission and Cyprus Research and
entist in SRIBD, The Chinese University of Hong Kong, Innovation Foundation. His research interests lie in the fields
Shenzhen, from 2018 to 2021. His research interests include of Bayesian deep learning. Characteristic application areas
Bayesian machine learning for tensor data analytics and inter- include recommendation systems, natural language under-
pretable machine learning for information systems. standing, video understanding, inference from financial data,
Feng Yin ([email protected]) received his B.Sc. as well as unbiasedness, exploitability and trustworthiness in
degree from Shanghai Jiao Tong University, China, and his the era of machine learning.
M.Sc. and Ph.D. degrees from Technische Universitaet, Tsung-Hui Chang ([email protected])
Darmstadt, Germany. He was a postdoc researcher with received his B.S. degree in electrical engineering and his
Ericsson Research, Linkoping, Sweden and is now an assis- Ph.D. degree in communications engineering from the
tant professor with The Chinese University of Hong Kong, National Tsing Hua University, Hsinchu, Taiwan, in 2003 and
Shenzhen 518100, China. He was a recipient of the Marie 2008, respectively. He is an associate professor at the School
Curie Young Scholarship in 2014. He is an associate editor of of Science and Engineering, The Chinese University of Hong
the Elsevier journal, Signal Processing. His research interests Kong, and Shenzhen Research Institute of Big Data,
include statistical signal processing, Bayesian learning and Shenzhen 518100, China. He is an elected member of IEEE
optimization, and sensor fusion. SPS SPCOM TC and the founding chair of IEEE SPS ISAC
Sergios Theodoridis ([email protected]) received his TWG. He received the IEEE ComSoc Asian-Pacific
Ph.D. degree from the Department of Electronics and Outstanding Young Researcher Award in 2015, and the IEEE
Electrical Engineering, University of Birmingham, U.K, in SPS Best Paper Award in 2018 and 2021. He is currently a

50 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


senior area editor of IEEE TSP and an associate editor of [23] A. G. Wilson and P. Izmailov, “Bayesian deep learning and a probabilistic per-
spective of generalization,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS),
IEEE OJSP. His research interests lie in optimization prob- 2020, pp. 1–12.
lems in data communications and machine learning. [24] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine
Learning. Cambridge, MA, USA: MIT Press, 2006.

References [25] Y. Xu, F. Yin, W. Xu, J. Lin, and S. Cui, “Wireless traffic prediction with scal-
able Gaussian process: Framework, algorithms, and verification,” IEEE J. Sel. Areas
[1] S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective,
Commun., vol. 37, no. 6, pp. 1291–1306, Jun. 2019, doi: 10.1109/JSAC.
2nd ed. San Diego, CA, USA: Academic, 2020.
2019.2904330.
[2] K. Panousis, S. Chatzis, and S. Theodoridis, “Nonparametric Bayesian deep
[26] J. Lee, J. Sohl-Dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y.
networks with local competition,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019,
Bahri, “Deep neural networks as Gaussian processes,” in Proc. Int. Conf. Learn.
pp. 4980–4988.
Representations (ICLR), Vancouver, BC, Canada, 2018.
[3] K. Panousis, S. Chatzis, A. Alexos, and S. Theodoridis, “Local competition and
[27] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications.
stochasticity for adversarial robustness in deep learning,” in Proc. Int. Conf. Artif.
Cambridge, U.K.: Cambridge Univ. Press, 2012.
Intell. Statist. (AISTAT), 2021, vol. 130, pp. 3862–3870.
[28] M. Elad, Sparse and Redundant Representations: From Theory to
[4] K. Panousis, S. Chatzis, and S. Theodoridis, “Stochastic local winner-takes-all
Applications in Signal and Image Processing. New York, NY, USA: Springer
networks enable profound adversarial robustness,” in Proc. Adv. Neural Inf.
Science & Business Media, 2010.
Process. Syst. (NeurIPS), 2021.
[29] D. F. Andrews and C. L. Mallows, “Scale mixtures of normal distributions,” J.
[5] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep
Roy. Statist. Soc., B (Methodol.), vol. 36, no. 1, pp. 99–102, Sep. 1974, doi:
learning,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 3288–
10.1111/j.2517-6161.1974.tb00989.x.
3298.
[30] D. P. Wipf and B. D. Rao, “Sparse Bayesian learning for basis selection,” IEEE
[6] S. Ghosh, J. Yao, and F. Doshi-Velez, “Model selection in Bayesian neural net-
Trans. Signal Process., vol. 52, no. 8, pp. 2153–2164, Aug. 2004, doi: 10.1109/
works via horseshoe priors,” J. Mach. Learn. Res., vol. 20, no. 182, pp. 1–46, Oct.
TSP.2004.831016.
2019.
[31] T. L. Griffiths and Z. Ghahramani, “The Indian buffet process: An introduction
[7] F. Yin, L. Pan, T. Chen, S. Theodoridis, and Z.-Q. Luo, “Linear multiple low-
and review,” J. Mach. Learn. Res., vol. 12, no. 32, pp. 1185–1224, Jan. 2011.
rank kernel based stationary Gaussian processes regression for time series,” IEEE
[Online]. Available: http://jmlr.org/papers/v12/griffiths11a.html
Trans. Signal Process., vol. 68, pp. 5260–5275, Sep. 2020, doi: 10.1109/
TSP.2020.3023008. [32] D. J. MacKay, “Probable networks and plausible predictions — A review of
practical Bayesian methods for supervised neural networks,” Netw., Comput. Neural
[8] T. Paananen, J. Piironen, M. R. Andersen, and A. Vehtari, “Variable selection
Syst., vol. 6, no. 3, pp. 469–505, Feb. 1995, doi: 10.1088/0954-898X_6_3_011.
for Gaussian processes via sensitivity analysis of the posterior predictive distribu-
tion,” in Proc. Int. Conf. Artif. Intell. Statist. (AISTAT), 2019, pp. 1743–1752. [33] J. Bridle, “Training stochastic model recognition algorithms as networks can
lead to maximum mutual information estimation of parameters,” in Proc. Adv.
[9] H. Kim and Y. W. Teh, “Scaling up the automatic statistician: Scalable structure
Neural Inf. Process. Syst. (NeurIPS), 1990, vol. 2, pp. 211–217.
discovery using Gaussian processes,” in Proc. Int. Conf. Artif. Intell. Statist.
(AISTAT), Lanzarote, Spain, 2018, vol. 84, pp. 575–584. [34] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani,
“Structure discovery in nonparametric regression through compositional kernel
[10] L. Cheng, Z. Chen, Q. Shi, Y.-C. Wu, and S. Theodoridis, “Towards flexible
search,” in Proc. Int. Conf. Mach. Learn. (ICML), Atlanta, GA, USA, 2013, pp.
sparsity-aware modeling: Automatic tensor rank learning using the generalized
1166–1174.
hyperbolic prior,” IEEE Trans. Signal Process., vol. 70, no. 1, pp. 1834–1849, Apr.
2022, doi: 10.1109/TSP.2022.3164200. [35] T. Zhang, F. Yin, and Z.-Q. Luo, “Fast generic interaction detection for model
interpretability and compression,” in Proc. Int. Conf. Learn. Representations
[11] L. Cheng, X. Tong, S. Wang, Y.-C. Wu, and H. V. Poor, “Learning nonnegative
(ICLR), 2022, pp. 1–29.
factors from tensor data: Probabilistic modeling and inference algorithm,” IEEE
Trans. Signal Process., vol. 68, pp. 1792–1806, Feb. 2020, doi: 10.1109/ [36] M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. R.
TSP.2020.2975353. Figueiras-Vidal, “Sparse spectrum Gaussian process regression,” J. Mach. Learn.
Res., vol. 11, pp. 1865–1881, Aug. 2010.
[12] Y. Zhou and Y.-M. Cheung, “Bayesian low-tubal-rank robust tensor factoriza-
tion with multi-rank determination,” IEEE Trans. Pattern Anal. Mach. Intell., vol. [37] A. Wilson and R. P. Adams, “Gaussian process kernels for pattern discovery
43, no. 1, pp. 62–76, Jan. 2021, doi: 10.1109/TPAMI.2019.2923240. and extrapolation,” in Proc. Int. Conf. Mach. Learn. (ICML), Atlanta, GA, USA,
2013, pp. 1067–1075.
[13] L. Cheng, Y.-C. Wu, and H. V. Poor, “Probabilistic tensor canonical polyadic
decomposition with orthogonal factors,” IEEE Trans. Signal Process., vol. 65, no. [38] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.
3, pp. 663–676, Feb. 2017, doi: 10.1109/TSP.2016.2603969. Cambridge, U.K.: Cambridge Univ. Press, 2004.
[14] Q. Zhao, L. Zhang, and A. Cichocki, “Bayesian CP factorization of incomplete [39] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Proc.
tensors with automatic rank determination,” IEEE Trans. Pattern Anal. Mach. Int. Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2007, pp. 1177–1184.
Intell., vol. 37, no. 9, pp. 1751–1763, Jan. 2015, doi: 10.1109/TPAMI.2015. [40] M. W. Seeger, “Bayesian inference and optimal design for the sparse linear
2392756. model,” J. Mach. Learn. Res., vol. 9, pp. 759–813, Jun. 2008.
[15] Z. Zhang and C. Hawkins, “Variational Bayesian inference for robust stream- [41] M. E. Tipping and A. Faul, “Fast marginal likelihood maximisation for sparse
ing tensor factorization and completion,” in Proc. IEEE Int. Conf. Data Mining Bayesian models,” in Proc. Int. Conf. Artif. Intell. Statist. (AISTAT), Key West,
(ICDM), Singapore, 2018, pp. 1446–1451, doi: 10.1109/ICDM.2018.00200. FL, USA, 2003, pp. 3–6.
[16] Y. Dai, T. Zhang, Z. Lin, F. Yin, S. Theodoridis, and S. Cui, “An interpretable [42] N. I. Achieser, Theory of Approximation. New York, NY, USA: Dover,
and sample efficient deep kernel for Gaussian process,” in Proc. Int. Conf. 1992.
Uncertainty Artif. Intell. (UAI), 2020, pp. 759–768.
[43] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C.
[17] L. Cheng, Y.-C. Wu, and H. V. Poor, “Scaling probabilistic tensor canonical Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE
polyadic decomposition to massive data,” IEEE Trans. Signal Process., vol. 66, no. Trans. Signal Process., vol. 65, no. 13, pp. 3551–3582, Jul. 2017, doi: 10.1109/
21, pp. 5534–5548, Nov. 2018, doi: 10.1109/TSP.2018.2865407. TSP.2017.2690524.
[18] L. Cheng and Q. Shi, “Towards overfitting avoidance: Tuning-free tensor-aided [44] E. E. Papalexakis, N. D. Sidiropoulos, and R. Bro, “From k-means to higher-
multi-user channel estimation for 3D massive MIMO communications,” IEEE J. way co-clustering: Multilinear decomposition with sparse latent factors,” IEEE
Sel. Topics Signal Process., vol. 15, no. 3, pp. 832–846, Apr. 2021, doi: 10.1109/ Trans. Signal Process., vol. 61, no. 2, pp. 493–506, Jan. 2013, doi: 10.1109/
JSTSP.2021.3058019. TSP.2012.2225052.
[19] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, [45] H. Becker, L. Albera, P. Comon, R. Gribonval, F. Wendling, and I. Merlet,
USA: Springer-Verlag, 2006. “Brain-source imaging: From sparse to tensor models,” IEEE Signal Process. Mag.,
[20] S. F. Gull, “Bayesian inductive inference and maximum entropy,” in vol. 32, no. 6, pp. 100–112, Nov. 2015, doi: 10.1109/MSP.2015.2413711.
Maximum-entropy and Bayesian Methods in Science and Engineering, G. J. [46] C. Chatzichristos, E. Kofidis, M. Morante, and S. Theodoridis, “Blind fMRI
Erickson and C. R. Smith, Eds. Dordrecht, The Netherlands: Springer-Verlag, 1988, source unmixing via higher-order tensor decompositions,” J. Neurosci. Methods,
pp. 53–74. vol. 315, pp. 17–47, Mar. 2019, doi: 10.1016/j.jneumeth.2018.12.007.
[21] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, [47] H. Chen, F. Ahmad, S. Vorobyov, and F. Porikli, “Tensor decompositions
pp. 461–464, Jul. 1978, doi: 10.1214/aos/1176344136. in wireless communications and MIMO radar,” IEEE J. Sel. Topics Signal
[22] D. J. MacKay, “Bayesian interpolation,” Neural Comput., vol. 4, no. 3, pp. Process., vol. 15, no. 3, pp. 438–453, Apr. 2021, doi: 10.1109/JSTSP.2021.
415–447, May 1992, doi: 10.1162/neco.1992.4.3.415. 3061937.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 51


[48] Q. Zhao, L. Zhang, and A. Cichocki, “Bayesian sparse tucker models for [75] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adver-
dimension reduction and tensor completion,” 2015. [Online]. Available: https:// sarial examples,” in Proc. Int. Conf. Learn. Representations (ICLR), 2015.
arXiv.org/abs/1505.02343. [76] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural
[49] C. Hawkins and Z. Zhang, “Bayesian tensorized neural networks with automat- networks,” in Proc. IEEE Symp. Security Privacy. San Jose, CA, USA: IEEE
ic rank selection,” Neurocomputing, vol. 453, no. C, pp. 172–180, Sep. 2021, doi: Computer Society, May 2017, pp. 39–57.
10.1016/j.neucom.2021.04.117. [77] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep
[50] L. Xu, L. Cheng, N. Wong, and Y.-C. Wu, “Probabilistic tensor train decomposition learning models resistant to adversarial attacks,” 2017, arXiv:1706.06083.
with automatic rank determination from noisy data,” in Proc. 2021 IEEE Statist. Signal [78] R. McAllister, Y. Gal, A. Kendall, M. van der Wilk, A. Shah, R. Cipolla, and A.
Process. Workshop (SSP), pp. 461–465, doi: 10.1109/SSP49050.2021.9513808. Weller, “Concrete problems for autonomous vehicle safety: Advantages of Bayesian
[51] L. Xu, L. Cheng, N. Wong, and Y.-C. Wu, “Overfitting avoidance in tensor train deep learning,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2017, pp. 4745–4753.
factorization and completion: Prior analysis and inference,” in Proc. 2021 IEEE Int. [79] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P.
Conf. Data Mining (ICDM), pp. 1439–1444, doi: 10.1109/ICDM51629.2021.00185. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in Proc. Int.
[52] Y. Panagakis, J. Kossaifi, G. G. Chrysos, J. Oldfield, M. A. Nicolaou, A. Conf. Learn. Representations (ICLR), Vancouver, BC, Canada, 2018.
Anandkumar, and S. Zafeiriou, “Tensor methods in computer vision and deep learning,” [80] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,
Proc. IEEE, vol. 109, no. 5, pp. 863–890, Apr. 2021, doi: 10.1109/JPROC.2021.3074329. “Learning from simulated and unsupervised images through adversarial training,” in
[53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA,
arXiv:1412.6980. 2017, pp. 2107–2116.
[54] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence analysis of [81] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille, “Mitigating adversarial
block successive minimization methods for nonsmooth optimization,” SIAM J. effects through randomization,” in Proc. Int. Conf. Learn. Representations (ICLR),
Optim., vol. 23, no. 2, pp. 1126–1153, Jun. 2013, doi: 10.1137/120891009. Vancouver, BC, Canada, 2018.
[55] M. Hong, Z.-Q. Luo, and M. Razaviyayn, “Convergence analysis of alternating [82] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, “Deflecting adver-
direction method of multipliers for a family of nonconvex problems,” SIAM J. sarial attacks with pixel deflection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Optim., vol. 26, no. 1, pp. 337–364, Jan. 2016, doi: 10.1137/140990309. Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 8571–8580.
[56] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization algorithms in [83] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna,
signal processing, communications, and machine learning,” IEEE Trans. Signal and A. Anandkumar, “Stochastic activation pruning for robust adversarial defense,” in Proc.
Process., vol. 65, no. 3, pp. 794–816, Feb. 2017, doi: 10.1109/TSP.2016.2601299. Int. Conf. Learn. Representations (ICLR), Vancouver, BC, Canada, 2018.
[57] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the [84] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc. Brit.
nonconvex world: From batch data to streaming and beyond,” IEEE Signal Process. Mach. Vis. Conf. (BMVC), 2016.
Mag., vol. 37, no. 3, pp. 26–38, May 2020, doi: 10.1109/MSP.2020.2970170. [85] B. Wu, J. Chen, D. Cai, X. He, and Q. Gu, “Do wider neural networks really
[58] G. Parisi and R. Shankar, Statistical Field Theory. Boulder, CO, USA: help adversarial robustness?” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS),
Westview, 1988. 2021, pp. 7054–7067.
[59] C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt, “Advances in variational [86] T. Pang, X. Yang, Y. Dong, K. Xu, J. Zhu, and H. Su, “Boosting adversarial
inference,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 2008–2026, training with hypersphere embedding,” in Proc. Adv. Neural Inf. Process. Syst.
Aug. 2019, doi: 10.1109/TPAMI.2018.2889774. (NeurIPS), 2020, pp. 7779–7792.
[60] R. Zhang, C. Li, J. Zhang, C. Chen, and A. G. Wilson, “Cyclical stochastic [87] D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can improve
gradient MCMC for Bayesian deep learning,” in Proc. Int. Conf. Learn. model robustness and uncertainty,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019,
Representations (ICLR), 2020. pp. 2712–2721.
[61] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in [88] S. Gowal, C. Qin, J. Uesato, T. Mann, and P. Kohli, “Uncovering the limits of
Proc. Int. Conf. Comput. Statist. (COMPSTAT). Paris, France: Springer-Verlag, adversarial training against norm-bounded adversarial examples,” 2021,
2010, pp. 177–186. arXiv:2010.03593.
[62] G. Lan, First-Order and Stochastic Optimization Methods for Machine [89] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a false
Learning. Cham: Springer Nature Switzerland AG, 2020. sense of security: Circumventing defenses to adversarial examples,” in Proc. Int.
[63] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Conf. Mach. Learn. (ICML), 2018, pp. 274–283.
Cambridge Univ. Press, 2004. [90] F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an
[64] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimiza- ensemble of diverse parameter-free attacks,” in Proc. Int. Conf. Mach. Learn.
tion and statistical learning via the alternating direction method of multipliers,” Found. (ICML), 2020, pp. 2206–2216.
Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, Jan. 2011, doi: 10.1561/2200000016. [91] B. W. Bader, T. G. Kolda, and R. A. Harshman, “Temporal analysis of social
[65] O. Shayer, D. Levi, and E. Fetaya, “Learning discrete weights using the local networks using three-way DEDICOM,” Sandia National Lab. (SNL-NM),
reparameterization trick,” in Proc. Int. Conf. Learn. Representations (ICLR), Albuquerque, NM, USA, Sandia, Tech. Rep. SAND2006-2161, 2006.
Vancouver, BC, Canada, 2018. [92] C.-Y. Ko, K. Batselier, L. Daniel, W. Yu, and N. Wong, “Fast and accurate ten-
[66] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int. sor completion with total variation regularized tensor trains,” IEEE Trans. Image
Conf. Learn. Representations, 2014. Process., vol. 29, pp. 6918–6931, May 2020, doi: 10.1109/TIP.2020.2995061.
[67] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A contin- [93] J. A. Bengua, H. N. Phien, H. D. Tuan, and M. N. Do, “Efficient tensor comple-
uous relaxation of discrete random variables,” in Proc. Int. Conf. Learn. tion for color image and video recovery: Low-rank tensor train,” IEEE Trans. Image
Representations, 2017. Process., vol. 26, no. 5, pp. 2466–2479, May 2017, doi: 10.1109/TIP.2017.2672439.
[68] P. Kumaraswamy, “A generalized probability density function for double- [94] L. Yuan, Q. Zhao, and J. Cao, “High-order tensor completion for data recovery
bounded random processes,” J. Hydrol., vol. 46, nos. 1–2, pp. 79–88, Mar. 1980, via sparse tensor-train optimization,” in Proc. IEEE Int. Conf. Acoust., Speech
doi: 10.1016/0022-1694(80)90036-0. S ig n a l P ro c e s s . (I C A S S P), 2 018 , p p. 12 58 –12 62 , d oi: 10.110 9/
ICASSP.2018.8462592.
[69] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735. [95] F. Wenzel et al., “How good is the Bayes posterior in deep neural networks real-
ly?” in Proc. Int. Conf. Mach. Learn. (ICML), 2020, vol. 119, pp. 10,248–10,259.
[70] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods. Berlin,
Heidelberg: Springer-Verlag, 1986. [96] B. Adlam, J. Snoek, and S. L. Smith, “Cold posteriors and aleatoric uncertain-
ty,” 2020. [Online]. Available: https://arXiv.org/abs/2008.00029
[71] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang,
“Informer: Beyond efficient transformer for long sequence time-series forecasting,” [97] K. Chen, Q. Kong, Y. Dai, Y. Xu, F. Yin, L. Xu, and S. Cui, “Recent advances
in Proc. AAAI Conf. Artif. Intell. (AAAI), 2021, vol. 35, no. 12, pp. 11,106–11,115. in data-driven wireless communication using gaussian processes: A comprehensive
survey,” China Commun., vol. 19, no. 1, pp. 218–237, Jan. 2022, doi: 10.23919/
[72] Y. Xu, F. Yin, W. Xu, C.-H. Lee, J. Lin, and S. Cui, “Scalable learning para- JCC.2022.01.016.
digms for data-driven wireless communication,” IEEE Commun. Mag., vol. 58, no.
10, pp. 81–87, 2020, doi: 10.1109/MCOM.001.2000143. [98] P. Izmailov, A. Novikov, and D. Kropotov, “Scalable Gaussian processes with
billions of inducing inputs via tensor train decomposition,” in Proc. Int. Conf.
[73] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable pre- Artif. Intell. Statist. (AISTAT). Playa Blanca, Spain: PMLR, 2018, pp. 726–735.
dictive uncertainty estimation using deep ensembles,” in Proc. Int. Conf. Neural Inf.
Process. Syst. (NIPS), 2017, pp. 6405–6416. [99] A. Tjandra, S. Sakti, and S. Nakamura, “Compressing recurrent neural net-
work with tensor train,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN).
[74] A. Zoubir and B. Boashash, “The bootstrap and its application in signal pro- Anchorage, AK, USA: IEEE, 2017, pp. 4451–4458.
cessing,” IEEE Signal Process. Mag., vol. 15, no. 1, pp. 56–76, Jan. 1998, doi:
10.1109/79.647043.  SP

52 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


Daniel Romero and Seung-Jun Kim

Radio Map Estimation


A data-driven approach to spectrum cartography

SHUTTERSTOCK.COM/G/SKILLUP

R
adio maps characterize quantities of interest in radio com- Introduction
munication environments, such as the received signal Spectrum cartography comprises a collection of techniques
strength and channel attenuation, at every point of a geo- used to construct and maintain radio maps, which provide use-
graphical region. Radio map estimation (RME) typically ful information on the radio-frequency (RF) landscape, such
entails interpolative inference based on spatially distributed mea- as the received signal power, interference power, power spec-
surements. In this tutorial article, after presenting some representa- tral density (PSD), electromagnetic absorption, and channel
tive applications of radio maps, the most prominent RME methods gain across a geographic area; see, e.g., [1], [2], and [3]. A
are discussed. Starting from simple regression, the exposition quick overview of the most representative types of radio maps
gradually delves into more sophisticated algorithms, eventually is provided in Table 1.
touching upon state-of-the-art techniques. To gain insight into this Radio maps find a myriad of applications in wireless com-
versatile toolkit, illustrative toy examples will also be presented. munications and networking, such as network planning, inter-
ference coordination and mitigation, power control, resource
allocation, handoff management, multihop routing, dynamic
Digital Object Identifier 10.1109/MSP.2022.3200175
Date of current version: 27 October 2022 spectrum access, and cognitive radio networking tasks; see [4]

1053-5888/22©2022IEEE IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 53


and [5] and the references therein. Radio maps are also useful effectively approximating the solutions of Maxwell’s equations
for localization [2] and tomography [6]. in complex environments. However, besides their high compu-
Arguably, spectrum cartography can be traced back to tational complexity, their main limitation is that an accurate
the application of Maxwell’s equations to characterize the description of the propagation environment is required through
propagation of radio waves across space. However, due to 3D models of all objects and obstacles along with their electro-
insufficient computational capacity, this approach has been magnetic properties.
traditionally confined to problems involving relatively simple To mitigate such limitations, RME was proposed, origi-
geometries, such as determining the electromagnetic field nally in the context of cognitive radios [1]. In RME, a collec-
radiated by a dipole. To analyze more complex environments, tion of measurements acquired by spatially distributed sensors
numerous empirical models have been developed, such as the is used together with their locations to construct a map of the
well-known P recommendations from the International Tele- relevant RF descriptors, typically by applying some form of
communication Union–Radiocommunication Sector. Unfortu- interpolation. As this approach does not require physical mod-
nately, this kind of model often fails to provide estimates that eling of the propagation environment, it constitutes a data-
are accurate enough for a given application [7]. driven alternative to the model-based techniques mentioned
With the advent of modern computational resources, finite- earlier. Since its conception, a sizable body of literature has
element analysis and ray-tracing techniques paved the way for emerged on the estimation of a variety of kinds of radio maps

Table 1. The most prominent types of radio maps.

Type of Map Illustration in a 1D Scenario Example Applications Construction Complexity Changes if . . .


Coverage map 1 Find coverage holes The base station only •  the environment changes.
needs to know if the • the transmission activity
Coverage

0.5 mobile user can receive changes.


data. • the transmitter location or
orientation changes.
0 20 40 60 80 100
x (m)

Outage Improve the reliability of a


Outage Probability

probability map 0.3 cellular network


0.2

0.1

0 20 40 60 80 100
x(m)

Power map Unveil regions of high inter- The mobile user reports
Power (dBm)

–42
ference power measurements.
–44 Determine appropriate loca-
–46
tions for new base stations

0 20 40 60 80 100
x(m)

PSD Map 2,460 Maximize frequency reuse The mobile user reports
Frequency (MHz)

power (density) measure-


2,440 ments for each frequency
(e.g., a periodogram).
2,420

0 25 50 75
x (m)

Channel gain 0 Resource allocation for The user at position x1 •  the environment changes.
map device-to-device communi- sends a pilot sequence.
cations The user at x2 sends an
estimate of the received
x 1(m)

50 power to the base sta-


tion after normalizing by
the transmitted power.

0 50
x 2(m)

Although radio maps find applications in many domains, this table exemplifies their applicability in cellular communications for specificity. The x-coordinate indexes a point on a
road or railway.

54 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


for a wide range of application scenarios; see, e.g., [2], [4], constructed by replacing p (x) in this definition with the sig-
[5], [8], and [9] as well as the references therein. Recently, the nal-to-noise-power ratio or the signal-to-interference-plus-
work in this area has intensified thanks to the boom of deep noise-power ratio.
learning [10], [11], [12], [13]. Coverage maps are often used by cellular and TV broad-
This article provides an introduction to RME by guiding cast network operators to find areas of weak coverage, which
the reader on the foundations and applications of RME as allows them to determine suitable sites for deploying new
well as on recent advances in this rapidly growing research base stations and relay antennas. A more recent application
area. To this end, the most common types of radio maps are is mission planning for autonomous mobile robots or vehicles
first described. Afterward, RME methods that require network connectivity, where
for signal strength and propagation maps coverage maps may assist in, e.g., mini-
Signal strength maps
are expounded in a tutorial fashion. Practi- mizing the time and distance traversed
cal considerations and future directions are focus on metrics of the without connectivity.
also discussed. received signal, which
are determined by the Outage probability maps
Radio maps and their applications aggregate effects of the A soft version of coverage maps can be
The signal received at a certain location is channel upon the signals constructed by adopting a probabilistic per-
determined by 1) the transmitted signal and spective, as the effects of the channel, such
transmitted by all
2) the communication channel between the as fading and shadowing, are often mod-
transmitter and the receiver. Depending on active sources. eled as random. An outage probability map
whether the focus is on the combined effect q (x) is a function q : X " 60, 1@ that pro-
of the two or, rather, on the effect of the propagation channel vides the probability that p (x) 1 c. Since outage probability
itself, two families of radio maps can be considered: signal maps capture more detailed information than coverage maps,
strength maps and propagation maps. the former can be readily employed in the applications of the
For simplicity, unless stated otherwise, it will be assumed latter. However, the additional information provided by outage
that the maps do not change significantly within the time probability maps allows more sophisticated decision making,
interval under consideration. In practice, the length of the as in route planning [14].
interval for which this assumption remains valid depends
not only on the speed of variation but also on the spe- Power maps
cific application. A substantially finer characterization of the signal strength is
obtained by a power map, defined as a function p : X " R,
Signal strength maps which returns the received power p (x) at every spatial loca-
Signal strength maps focus on metrics of the received signal, tion x ! X. As the information contained in power maps is
which are determined by the aggregate effects of the chan- richer than that in coverage or outage probability maps, power
nel upon the signals transmitted by all active sources. This is maps can be used not only for tasks such as network planning
the case, for instance, if the goal is to map interference and trajectory optimization but also for localizing transmitters
power levels. Constructing such maps does not require [2]. Also, in fingerprint-based localization, a mobile device
knowledge of the number, locations, and power of the trans- can measure the received powers of nearby access points and
mitters, which is appealing in scenarios involving a large determine its own position by matching the measurements
number of mobile transmitters, as in device-to-device com- with the values of the map.
munications or cellular uplink channels. Different kinds of
signal strength maps are presented next by the increasing PSD maps
level of detail they capture. One is sometimes interested in the power distribution not only
across space but also across the frequency domain. A PSD map
Coverage maps is a function p : X # F " R that provides the PSD p ^ x, f h of
The coarsest characterization of the radio environment can the received signal at each location x ! X. Here, f ! F is the
be provided by a map that takes only binary values for cov- frequency variable and the set F 1 R contains the frequencies
erage indication. Specifically, let p (x) denote the signal of interest. If the latter is discretized as F = " f1, f, fN f ,, con-
power that a radio with an isotropic antenna (the case of structing a PSD map is tantamount to constructing a collection
nonisotropic antenna patterns is discussed later) receives at a of power maps proportional to p (x, f1), f, p (x, fN f ).
spatial location x ! X, where X represents a geographical In addition to the applications mentioned for the previous
region of interest, typically a subset of R, R 2, or R 3. A cov- kinds of signal strength maps, PSD maps enable additional use
erage map is a function s : X " " 0, 1 , that takes the value cases. For example, they can be used for speeding up handoff
s (x) = 1 if p (x) $ c and zero otherwise, where c is a procedures in cellular networks by providing the quality of the
given threshold. This threshold may correspond to the relevant channels at a given location, obviating the need for
minimum signal power necessary to guarantee a pre- time-consuming channel measurement or feedback processes.
scribed communication rate. Coverage maps may also be PSD maps can also be utilized for interference coordination

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 55


where concurrent transmissions are assigned to different fre- receivers, which arises in the context of cognitive radios [15].
quency band channels based on the transceiver locations, pro- For example, when reusing the TV spectrum, the challenge
moting efficient spectrum reuse. In cognitive radio networks, is to carry out unlicensed transmissions without introducing
PSD maps can unveil underutilized “white spaces” in the detrimental interference to TV receivers. With a propagation
space/frequency/time domains, which can be exploited oppor- map, one can ensure that no receivers in a certain area will
tunistically by unlicensed users [15]. be negatively affected without the need to know their precise
locations [16].
Propagation maps Yet another application is the problem of aerial base sta-
Whereas signal strength maps capture the aggregate effect of tion placement, where a propagation map of the air-to-ground
the transmitted signals and the channels, propagation maps channel can be constructed to determine the best set of loca-
focus exclusively on the channel. Each parameter of interest tions to deploy unmanned aerial vehicle (UAV)-mounted base
gives rise to a different kind of propagation stations to serve ground users [17].
map. As described next, channel gain maps
constitute the simplest kind. Suppose that Whereas signal strength Estimation of signal strength maps
RX
p denotes the power received at location maps capture the In a typical RME formulation, the goal is to
x RX due to a transmitter with power p TX aggregate effect of the construct a radio map using a set of mea-
at location x TX . A channel gain map is a transmitted signals and surements acquired by spatially dispersed
function h : X # X " R of the transmitter the channels, propagation sensors together with their locations. For
and receiver locations that provides the signal strength maps, consider N measure-
channel gain h ^ x TX, x RXh = p RX /p TX. More
maps focus exclusively on ments, where the nth measurement m is
n
sophisticated propagation maps arise by the channel. acquired by a sensor at location x n . In the
accounting for frequency selectivity. For case of power maps, m n may be the aver-
example, the power gain that each subcarrier sees in an orthogo- age power measured in a certain band within a given time
nal frequency division multiplexing system can be mapped. For interval, which can be modeled as m n = p ^ x n h + z n . Here, z n
simplicity, this article focuses on channel gain maps, which pro- denotes measurement noise, which is caused, e.g., by the
vide the overall gain that affects a single narrow frequency band. finite length of the averaging time interval. For estimating
Clearly, given a channel gain map h ^ x TX, x RXh togeth- PSD maps, m n can contain power spectrum measurements,
er with the locations ^ x TX 1 , f, x S h and transmit powers
TX
such as periodograms. The RME problem becomes construct-
p 1TX, f, p STX of S sources in a region, one can obtain the power ing the desired signal strength map given the pairs
map as p (x) = R s h (x TX TX
s , x) p s , provided that the signals "^ x n, m n h,nN= 1 .
transmitted by different sources are uncorrelated, as generally It is worth noting that each sensor may collect measure-
occurs in practice, except, e.g., in single-frequency networks, ments at multiple locations provided that they are taken within
such as the ones utilized by digital television broadcasting. a time window whose length is small relative to the scale of
Thus, propagation maps can be readily used in the applica- variations of the target map. Thus, the number of sensors may
tions of signal strength maps provided that the locations and be much smaller than N. In fact, the RME formulation can be
transmit powers of the sources are known. On the other hand, extended to accommodate the decision on where to acquire
propagation maps offer more versatile information than signal the measurements sequentially, as discussed in the “Spec-
strength maps: whereas a signal strength map may provide the trum Surveying” section. Furthermore, a sensor need not be a
total interference at each location, a propagation map reveals special-purpose device. For example, a user terminal in a cel-
the contribution of each source. This enhanced flexibility is lular network may function as a sensing device. The rest of the
instrumental for tasks such as interference coordination or net- section presents the main approaches for constructing signal
work planning. strength maps.
Observe that changes in the locations and transmit powers
of the sources give rise to changes in signal strength maps, Estimation of power maps
whereas propagation maps remain unaffected. On the other This section introduces the main power map estimation meth-
hand, alterations in the scattering environment, such as the ods, a comparison of which can be found in Table 2.
construction of new buildings or seasonal changes of foliage,
affect both propagation and signal strength maps. Thus, the Linear parametric RME
time scale of the variations of signal strength maps is never Let us start from the simple yet illustrative scenario where
greater than that of propagation maps. Hence, propagation there is a single transmitter with known location x TX 1 in free
maps can be used to construct signal strength maps in highly space. As per Friis’s transmission equation, the received
dynamic setups, such as the uplinks of cellular networks, power at location x is inversely proportional to the squared
where mobile users rapidly change their positions and activ- distance < x - x TX 2
1 < . In other words, p (x) can be written
ity patterns. as p (x) = a 1 } 1 (x) where } 1 (x)|= 1/< x - x 1TX <2, and a 1
Propagation maps can also help address the classic prob- depends on the (unknown) transmit power. Therefore, to esti-
lem of predicting the potential interference inflicted to passive mate p (x) everywhere, it suffices to obtain a 1. Clearly, this

56 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


could be accomplished from a single noiseless measurement metric. Further parametric and nonparametric estimators are
m 1 = p (x 1) at x 1 by setting a 1 = m 1 /} 1 (x 1). discussed in the rest of this section.
Similarly, if S transmitters with known locations x TX TX
1 , f, x S Figure 1 illustrates a setup where a map needs to be
TX 2
are active in a certain region, one can let } s (x)|= 1/< x - x s < , estimated on a line; i.e., the region of interest is given by
to write p (x) as X 1 R, which may correspond, e.g., to a road or a railway.
The true map and the estimated map obtained by substitut-
p (x) = a 1 } 1 (x) + g + a S } S (x) (1) ing at into the right-hand side (RHS) of (1) are compared.
The estimated map is seen to be reasonably accurate and
so long as the transmitted waveforms are uncorrelated.
Based on (1), one can typically estimate the S coefficients
{a s} from S noiseless measurements by solving the system
of equations 800
True Map (p)

"
600 LS Map Estimate (p)
m 1 = a 1 } 1 (x 1) + g + a S } S (x 1)
Measurements (mn)

Power (µW)
h  400
m S = a 1 } 1 (x S) + g + a S } S (x S). (2)
200
In practice, however, the measurements are noisy, and one may
use more than S of them to estimate the coefficients. Upon 0
defining a|= 6a 1, f, a S@<, m|= 6m 1, f, m N@<, ^W hn, s|= }s (x n),
and z|= 6z 1, f, z N@<, (2) can be extended to the case with
0 20 40 60 80 100
x (m)
N 2 S measurements as m = Wa + z. The least-squares (LS)
estimate of a is, therefore, at = argmin a < m - Wa <2 . Because FIGURE 1. An example of map estimation in 1D using a parametric estima-
the number S of parameters to be estimated does not depend on tor that knows the transmitter locations. The estimate is reasonably
the number N of measurements, this approach is termed para- accurate despite the low number of measurements.

Table 2. A comparison of the power map estimation methods discussed in this tutorial.
Method Input (Besides Measurements) Strengths Limitations
Linear parametric •  Transmitter locations x , f, x
TX
1
TX
S • Simplicity •  Inaccurate in non-LOS conditions
RME • Path loss law: e.g., •  Closed form •  Requires transmitter locations
} s (x) := 1/ < x - x TX
s <
2
•  Accuracy in LOS conditions
• Can easily accommodate knowledge of trans-
mit antenna patterns
Kernel-based •  Reproducing kernel l (x, xl ) •  High flexibility •  Sensitive to the choice of the kernel
learning • Loss L •  Does not require transmitter locations • Depending on L, may require a
•  Regularization parameter m numerical solver
• m must be tuned, e.g., via cross
validation
Kriging • Mean n p (x) and covariance •  LMMSE optimality • Accurate covariance structure may be
Cov [p (x), p (xl )] of p •  Closed form hard to obtain
•  Measurement noise variance v 2z • Naturally suited to the customary log-normal •  Requires user locations
shadowing model
•  Estimation error that can be quantified
Sparsity-based •  Discrete grid • Efficient algorithms available for obtaining • Prior knowledge on propagation char-
methods •  Regularization parameter m a solution acteristics needed
• Recovered sparse solution readily interpretable •  Errors due to grid mismatch
Matrix completion •  Regular grid •  Agnostic to propagation characteristics •  Critical low-rank condition
•  Regularization parameter m •  Spatial correlation structure exploited • Sufficient number of measurements
required for stable interpolation
Dictionary •  Dictionary size Q • Powerful union-of-subspace prior for spatial •  Nonconvex optimization
learning • Regularization parameters m s patterns •  Hyperparameter tuning necessary
and m L • Can accommodate rapid temporal dynamics
Deep learning •  Terrain maps • Can learn propagation patterns from a dataset •  Large amount of data required
•  Vegetation maps • More accurate than other methods if sufficient • Training that is computationally
•  Building height maps data are available [18] intensive
•  Network architecture
•  Training parameters
• Others

LMMSE: linear minimum mean-square error; LOS: line of sight.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 57


can be shown to converge to the true map for N " 3 under by Figure 2 for the same setup as in Figure  1, this kind of
mild conditions. regression method may be sensitive to the choice of the
So far, it was assumed that propagation takes place in basis functions.
free space. If this is not the case, then the basis functions
TX 2
} s (x) = 1/ < x - x s < may not yield a satisfactory fit. Al­­ Kernel-based learning
though one can, in principle, adopt other families of basis The main challenge faced by the parametric methods
functions, such as those determined by the well-known Oku- described in the previous section lies in the difficulty in select-
mura–Hata model, the flexibility of such an approach is rather ing suitable basis functions. This difficulty is further exacer-
limited. In addition, the location of the sources is required, bated in higher dimensions, such as when X = R 2 or R 3 .
which may not be a realistic assumption in some applications. Kernel-based learning can sidestep this issue while enjoying
These observations suggest generalizing (1) to simplicity, universality, and good performance [19].
Upon postulating a family of functions G, the goal is to
p (x) = a 1 }u 1 (x) + g + a B }u B (x) (3) select, based on the data " (x n, m n) ,nN= 1, a function pt in G that
satisfies pt (x) . p (x) 6x. In kernel-based learning, G is a
where }u b (x) can take an arbitrary form and need not even be special class of functions termed reproducing-kernel Hilbert
linked to any particular transmitter. For example, in the case space (RKHS), given by
where a map needs to be constructed on a line, " }u b (x) ,b
could form a polynomial basis by setting }u b (x) = }u b (x) = x b - 1.
G|= ' g : g (x) = / a i l^ x, xlih, xli ! X, a i ! R 6i 1 . (4)
3

The coefficients " a b , can again be found by LS estimation.



i =1
However, despite the appealing simplicity of this approach,
the quality of the estimates is often poor. As illustrated Here, l : X # X " R is a reproducing kernel [19, Ch. 2],
which is a function that is 1) symmetric—i.e., l ^ x, xlh =
l ^ xl, x h 6xl, x —and 2) positive definite, meaning that the
800 matrix Kr with entries ^ Kr hi, j = l (x i, x j) is positive definite for
any set of points " x 1, f, x N , . A common choice is the so-
600 called Gaussian radial basis function l ^ x, xlh := exp
` -< x - xl<2 2v 2 j, where v 2 0 is a prescribed parameter.
Power (µW)

400 Seen as a function of x, l ^ x, xli h is in this case a bell-


200
shaped surface centered at xli . Thus, it can be observed
from (4) that a function in G is a superposition of (a pos-
0 sibly infinite number of) Gaussian bells with different
centers and amplitudes, as illustrated in Figure 3.
0 20 40 60 80 100 A typical approach for finding a suitable estimate pt in G
x (m)
is to solve
True Map (p)
"

Polynomial LS Map Estimate (p) N


pt = argmin 1 / L (m n, g (x n)) + m 2
Measurements (mn) g G (5)
gdG N n =1

FIGURE 2. An example of map estimation by fitting a polynomial of degree where m 2 0 is a predetermined regularization parameter, and
13 via LS. The estimate is clearly unsatisfactory despite the fact that the
L is a loss function quantifying the deviation between the
estimate accurately fits most of the measurements.
observations " m n ,nN= 1 and the predictions " g (x n) ,nN= 1 pro-
duced by a candidate g. If the square loss L (m n, g (x n)) =
(m n - g (x n))2 is adopted, (5) becomes kernel ridge regression
800 α1κ(x, x1) α2κ(x, x2) (KRR) [19, Ch. 4] (Figures 4 and 5).
α3κ(x, x3) α4κ(x, x4) The RKHS norm of g (x) = R i3= 1 a i l ^ x, xli h is given by
600
α5κ(x, x5) Σ αiκ(x, xi)
Power (µW)

i
/ / a i a j l^ xli, xljh .(6)
3 3
400 < g < G|=
i =1 j =1
200
The term < g <2G in (5) can be replaced by other increasing
0 functions of < g < G, but this explanation considers just the spe-
cial case of < g <2G for simplicity.
0 20 40 60 80 100
x (m) To understand the role of the regularization term m < g < 2G
in (5), first note that L is typically designed so that its mini-
FIGURE 3. An example of a function in an RKHS obtained with the expan- mum is attained when g (x n) = m n . Thus, in the absence of the
sion in (4) with only five terms. regularization term, owing to the infinite degrees of freedom

58 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


of g [compare with (4)], the solution pt to (5) would achieve from which pt can be obtained via (7). Figures 4 and 5 show
a perfect fit for all measurements. However, such a pt would the KRR-based map estimates in the same setup as in Figures 1
typically be highly irregular since it would fit even the noise and 2. It can be seen that, as the number of measurements
component of the measurements and, thus, likely differ increases, the estimated map becomes closer to the true map.
significantly from p at the locations where It is worth mentioning that RME based
no measurements were taken. The regular- on kernel methods is best suited for scenar-
It is worth mentioning
ization term helps avoid such overfitting by ios where no prior knowledge of the propa-
promoting smoothness in pt . The reason that RME based on kernel gation environment is available. When prior
is that, since l is positive definite, < g < 2G methods is best suited information, such as the transmitter loca-
penalizes large values of " a i ,, which tend for scenarios where no tions or the path loss exponent, is indeed
to occur in overfitted solutions. Parameter prior knowledge of the available, it is also possible to combine the
m is adjusted to achieve the “sweet spot” propagation environment flexibility of nonparametric kernel meth-
between data fitting and regularization. ods with the ability of parametric methods
is available.
To solve (5), one could initially think of to capture prior information by means of
substituting the expansion (4) into (5) and appropriate basis functions. To this end, one
optimizing over the infinitely many coefficients " a i , and can postulate that p can be represented as the sum of a func-
centroids " xli , . However, this approach is obviously intracta- tion in the form of (3) and a function in an RKHS [20]. Such
ble. Instead, the so-called representer theorem can be invoked an approach also generalizes the so-called thin-plate spline
[19, Th. 4.2], which states that the solution to (5) must be of regression, which has well-documented merits in RME [4], [9].
the form Another limitation of kernel-based methods is the need
for choosing the kernel (including its parameters), which
N
may affect the estimation performance significantly. This
pt (x) = / a n l (x, x n) (7)
n =1 difficulty may be alleviated through multikernel learning,
where a dictionary of kernels can be specified and a suit-
for some " a n ,nN= 1 . Observe that the centroids in (7) are pre- ably designed algorithm uses the measurements to construct
cisely the measurement locations. This effectively reduces an a kernel by combining the kernels in the dictionary; see the
optimization problem with infinitely many variables to a prob- references in [18].
lem with just the N variables a 1, f, a N . For example, if one
adopts the square loss, substituting (7) into (5) yields Kriging
RME can also be formulated in a statistical framework, where
1 2 < p (x) is treated as a random process. A popular approach is
at = argmin < m - Ka < + ma Ka (8)
a N kriging, which is a linear spatial interpolator based on the lin-
ear minimum mean-square error (LMMSE) criterion [1], [21],
where (with some abuse of notation) a|= 6a 1, ..., a N@<, and K [22]. In simple kriging, the mean and the covariance of p (x)
is an N-by-N matrix with ^ K hi, j = l (x i, x j). It should be noted are assumed to be known. That is, n p (x)|= E 6 p (x)@ and
that now the number of parameters to be determined depends Cov 6 p (x), p (xl )@ are given for all x and xl. How to obtain
on the number of measurements N, which is why these kinds these functions is discussed later.
of methods are called nonparametric. Problem (8) admits the Under the measurement model m n = p (x n) + z n, n = 1,
closed-form solution 2, f, N, assume that z n is zero mean with variance v 2z and
uncorrelated with z nl for all nl ! n and with p (x) for all x. Thus,
at = ^ K + mNI N h-1 m (9) the mean and covariance of the measurements are, respectively,

800 800
True Map (p) True Map (p)
"

"

KRR Map Estimate (p) KRR Map Estimate (p)


600 600
Measurements (mn) Measurements (mn)
Power (µW)

Power (µW)

400 400

200 200

0 0

0 20 40 60 80 100 0 20 40 60 80 100
x (m) x (m)

FIGURE 4. An example of a KRR estimate. As expected, the quality of the FIGURE 5. An example of a KRR estimate with more measurements than
fit is higher in regions with higher measurement density. in Figure 4. The fit is considerably better.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 59


E [m n] = n p (x n) and Cov [m n, m nl] = Cov 6 p (x n), p (x nl)@ + Then, we have n p , dB (x) = p TX PL
dB + h (x) - n
SF
- n FF and
2
v z d n, nl, where d n, nl equals one if n = nl and zero otherwise. It Cov 6 p dB (x), p dB (xl )@ = v SF 2
2 - < x - xl</d SF 2
+ v FF d x, xl .
can also be verified that Cov 6 p (x), m n@ = Cov 6 p (x), p (x n)@ .
Then, it can be shown that the LMMSE estimator of p (x) Leveraging sparsity
based on the measurements m|= 6m 1, f, m N@< is given by In many practical RME problems, estimation performance can
be significantly improved by incorporating prior information.
pt (x) = n p (x) + Cov 6 p (x), m@ Cov 6m, m@^ m - E [m]h (10)
-1 The sparsity prior has played a critical role in compressive
sensing, in which framework RME problems can often be for-
mulated. Moreover, depending on the choice of the basis func-
where Cov[m, m] is the N # N matrix whose (n, nl) th entry tions, the sparsity prior can be physically interpreted in terms
is Cov [m n, m nl], and Cov 6 p (x), m@ is the 1 # N vector with of the spatial, temporal, and spectral scarceness of the RF
the nth entry equal to Cov 6 p (x), m n@ . energy distribution [2], [9].
It is worth comparing (10) with (7) and (9). It can be easily seen Consider once more the linear parametric RME model
that, except for the mean terms in (10), the estimators provided (1) but, rather than assuming that the number S and locations
by (10) and (7) coincide if one sets l (x, xl ) = Cov 6 p (x), p (xl )@, " x TX
s , of the transmitters are known, simply discretize the
and m is adjusted properly. This is a manifestation of the well- map area using N g grid points " xgrid n g ,n g 1 X representing the
known fact that a reproducing kernel can be thought of as a possible locations of the transmitters. Then, upon defining
generalization of covariance. As a result, some of the practical <
au |= [au 1, f, au N g] and W u ! R N # N g with ^W u hn, n g = } n g (x n)|=
grid 2
issues and corresponding mitigation strategies for kernel-based 1/< x n - x n g < for n = 1, f, N and n g = 1, f, N g, one has the
learning apply to kriging as well. model m = W u au + z. In practical scenarios,
To obtain the mean n p (x) and the cova- it is expected that only a small subset of
riance Cov 6 p (x), p (xl )@ of the map p (x) In many practical RME the grid points are actually occupied by
to be estimated, one can rely on historic problems, estimation transmitters; that is, S % N g . Thus, one can
measurement data. Given the covariance performance can be impose the sparsity prior on au . For exam-
function, universal kriging also provides a significantly improved ple, a least absolute shrinkage and selec-
framework to estimate n p (x) as a part of the tion operator problem can be formulated as
by incorporating prior tu |= argmin au m - W u au 2 + m au 1, where
kriging estimator. a
Next, a simple example with a single
information. 2
Ng
m 2 0, and the term au 1|= R n g = 1 au n g is
TX
transmitter at location x transmitting with known to promote sparsity in au . The non-
power p TX will be used to illustrate how the mean and covari- zero entries of the obtained atu reveal the (grid-based) locations
ance can be derived from common propagation models; a more " x TX
s , and the number S of the transmitters. Then, one can
sophisticated example involving the idea of universal kriging reconstruct the desired power map p (x) using (1).
and incorporating temporal variations as well will be presented As in the linear parametric RME approach, the adopted
in the “Nontomographic Approaches” section. To this end, note basis functions } n g (x) = 1/< x - x grid ng <
2
may not accurately
that the received power in the logarithmic scale can be written as capture the actual propagation characteristics. Possible rem-
TX
p dB (x) = p dB + h dB (x TX, x), where p TX TX
dB and h dB (x , x), are edies for this issue include sparse total LS [8], kernel-based
expressed in decibels. A common decomposition for the latter learning [9], and sparse Bayesian learning techniques [24].
is h dB (x TX, x) = h PL(x) - a SF(x) - a FF(x), where h PL (x) is the In particular, the basis mismatch issue due to the grid-based
path loss, a SF(x) is the attenuation due to shadow fading, and discretization of space can be mitigated in the atomic norm
a FF(x) is the attenuation due to fast fading. The dependence on minimization framework.
x TX and the subscript “dB” on the RHS have been omitted for
brevity. Recall that shadow fading is produced by obstructions Matrix completion
in the line of sight between the transmitter and the receiver, Another useful framework for RME is low-rank matrix
whereas fast fading is due to the constructive and destructive completion. For instance, consider building a power map over
interference between the different multipath components arriv- a rectangular area X 1 R 2 . By discretizing X using a regular
grid $ x (i, j) : i = 1, f, I, j = 1, f, J ., one can obtain a power
grid
ing at the receiver.
map matrix P ! R I # J where ^ P hi, j|= p ` x (i, j) j. Of course, only
PL grid
With this decomposition, it is common to model h (x) as
SF FF
a deterministic function of x. Furthermore, a (x) and a (xl ) a small subset of the entries will be actually observed by the
can be assumed to be uncorrelated for all x and xl and to sensors. However, when the grid is dense enough compared to
have means n SF and n FF respectively. The spatial structure the spatial variability of the map, adjacent entries of P will be
of a SF(x) is often captured by a simple correlation model, similar, which will, in turn, manifest itself as an approximate
such as the Gudmundson model [23], which prescribes that rank deficiency of P; that is, rank (P) % min " I, J , . Matrix
SF
Cov [a SF(x), a SF(xl )] = v 2SF 2 - < x - xl< /d . Here, v 2SF is a constant, completion, thus, tries to estimate the unobserved entries of P
and d SF is the distance at which the correlation decays by under a low-rank prior. Since directly promoting low rank
50%. On the other hand, due to the rapid spatial variability of gives rise to nonconvex problems, tractable formulations
a FF (x), it is reasonable to set Cov 6a FF(x), a FF(xl )@ = v 2FF d x, xl . are typically pursued by penalizing the nuclear norm of the

60 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


estimate, which is the sum of its singular values. Denote the set smoothness in the sense that the power estimates at adjacent
of indices of the observed entries as O 1 " 1, f, I , # " 1, f, J , sensors are similar.
and the nuclear norm of P as P ) . Also, let M be the matrix For training, at each time t, a subset N obs(t) 1 N |= " 1, f, N ,
grid
whose (i, j)th element equals the sensor measurement at x (i, j) of sensors acquires power measurements, which are stacked in
obs
if (i, j ) ! O and zero otherwise. A matrix completion problem vector m obs (t) ! R ;+N (t); . Also, let O (t) denote the matrix that
for the power map can be posed as contains the nth row of the N # N identity matrix if and only
if n ! N obs(t). Then, upon defining
minimize 1 / 6(P) (i, j) - (M ) (i, j)@2 + m P ) . (11)
2 (i, j) ! O f (s, D; m obs(t), O (t))|= 1 m obs(t) - O(t) Ds
P 2
2 2

With a sufficient number of observed entries, which depends + m s < s<1 + m L s< D< LDs (12)
2
on the rank and the incoherence of P, the desired map can be
reconstructed reliably. the dictionary can be learned via
When X grows large, the rank of P may increase, as the
T
power distribution may become more diverse. In this case, t |= argmin
D / f (s(t), D; mobs(t), O(t)) (13)
local matrix completion on submatrices of P may be a viable D d D," s (t) , t = 1
approach [25]. The matrix completion idea can also be extend-
ed to tensors when the maps in a 3D space are desired [26] or w h e r e D|= " [d 1, f, d Q] ! R N # Q : < d q < 22 # 1, q = 1, f, Q , .
when the time and frequency domains are considered together The first term in (12) promotes the fitness of the reconstruc-
with space [27]. tion to the training datum in an LS sense; the second term,
with an adjustable weight m s 2 0, is an , 1- norm-based regu-
Dictionary learning larizer encouraging sparsity in s; and the third term, with
When it is desired to capture the temporal variations of the weight m L $ 0, captures the prior information that the
power map, e.g., to exploit unused spectral resources over power levels at the neighboring sensor nodes should be sim-
both time and space, it is useful to learn a library of power ilar since it holds that v < Lv = (1/2) R nN= 1 R nNl = 1 a nnl(v n - v nl)2
maps from which the suitable one can be chosen to explain the for any v|= [v 1, f, v N ]< ! R N . Problem (13) can be solved effi-
power distribution at a given time. Dictionary learning is an ciently via a block coordinate descent algorithm [29].
unsupervised learning method that seeks a possibly overcom- In the operational phase, once the dictionary D t is obtained
plete basis, termed a dictionary, such that the data vectors can from (13), given a (new) set of measurements m r obs and the
be expressed as linear combinations of a small number of vec- r
corresponding observation matrix O (corresponding to the
tors in the dictionary. observation set N r obs), one first finds the sparse coefficients by
Denote the power measurements of the N sensors at time t solving rs|= argmin s f (s, D t;m r ). Then, the missing power
r obs, O
as m(t)|= [m 1 (t), f, m N (t)]< for t = 1, f, T. Dictionary learn- levels for sensors n ! N |= N \N r miss r obs can be obtained
ing postulates that m (t) can be represented using a dictionary by first reconstructing the whole m t
r = Dst r and extracting the
D ! R N # Q as m (t) . Ds (t), where s (t) ! R Q is a sparse vec- entries " mr n ,n ! Nr miss . A practical challenge is to implement the
t
tor of coefficients for the measurements at time t. The columns algorithm for online and distributed operation to handle large-
of D are called the atoms. Collecting the data samples into a scale real-time computation [29], [30]. Additionally, tuning the
matrix M|= [m (1), f, m (T )] ! R +N # T , one can appreciate hyperparameters, such as the dictionary size, and the regular-
that finding such a dictionary can be viewed as a matrix fac- ization parameters may require cross validation based on his-
torization task since M . DS, where S|= [s (1), f, s (T )] is a toric measurements.
sparse matrix. There are various optimization formulations to
learn D from M [28]. Deep learning
In the present context of power map estimation, consider A deep neural network (DNN) is a function g w that can be
the case where the sensors do not report their measurements expressed as the composition of more elementary functions
every time due to, e.g., energy-saving sleep modes or congested called layers, which are parameterized by a vector w. Training
signaling channels. Thus, the network controller must apply a DNN involves finding w so that g w fits the given dataset.
an appropriate interpolation technique to estimate the missing DNNs feature a large learning capacity and can be efficiently
observations. A helpful piece of side information is the topol- trained via stochastic optimization methods. Spatial structures
ogy of the network of sensors, which is typically maintained in the data can be readily exploited utilizing convolutional
for various network control tasks, such as routing. To leverage layers, in which case the DNN is called a convolutional neu-
this topology information, let A ! " 1, 0 ,N # N be the adjacency ral network (CNN). Next, multiple approaches for using
matrix of the network topology; i.e., the (n, nl ) th entry a n, nl of DNNs for signal strength map estimation are described.
A is equal to one if nodes n and nl can communicate directly
with each other and zero otherwise. The Laplacian matrix L Pointwise DNN estimators
is defined as L|= diag " A1 , - A, where 1 is the all-one vec- The simplest approach is to use a DNN to construct a function
tor. As seen next, this matrix can be used to promote spatial where the input is the sensor location, and the output is the

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 61


signal strength at that location. This approach was pursued in afterward. This effectively sets this approach halfway between
[10], where the input was encoded using a spherical coordi- signal strength and propagation map estimation.
nate system located at the (single) transmitter. Since the The practical limitation of this approach is that it requires
dimensionality of the input is small, the network architecture knowledge of the locations (and the transmit powers if one
can be kept simple, and the resulting estimator is not affected wishes to estimate the gains) of all transmitters, and the mea-
by the so-called curse of dimensionality [19, Sec. 4.3]. surements must be obtained separately for each transmitter.
However, such an approach cannot easily capture the spa- Furthermore, it only exploits the information in the vicinity
tial structure of the map using CNNs. Furthermore, the DNN of the sensor, but, in practice, obstacles or scatterers far away
needs to be retrained for each specific RF environment. There- from the sensor may also affect the received power signifi-
fore, it cannot benefit from measurements previously collected cantly. In addition, networks designed in this way provide
in other scenarios, such as different cities. the received power (or channel gain) only at a single location
per evaluation (also known as forward pass). To construct the
Local DNN estimators entire map, the estimator needs to be evaluated repeatedly
To alleviate the aforementioned limitations of pointwise for each point on a grid, resulting in significant computa-
DNN estimators, the network input can be replaced with a tional complexity.
collection of matrices that capture information about the local
environment of the sensor. These matrices, typically stacked Global DNN estimators
as slabs of a tensor, can be thought of as local maps defined To accommodate global, rather than local, environment infor-
over a rectangular grid centered at the sensor. A transmitter mation, one can create a regular grid across the region where
(alternatively, a sensor) distance map, for example, is a the map needs to be constructed and formulate the RME prob-
matrix whose (i, j)th entry equals the distance from the (i, j)th lem as a matrix or tensor completion task [13], [18], [31], [32],
grid point to the transmitter (sensor) [11], [12]. It is also pos- [34]. To this end, each measurement is associated with the
sible to include a local terrain map that indicates the altitude nearest grid point and a matrix is constructed with an entry per
of the terrain at each grid point. Further kinds of local maps grid point. If a single measurement is assigned to a grid point,
include building indicator maps [18], building height maps the corresponding entry contains the measurement. If multiple
[31], [32], or foliage maps [31]. One can also use aerial or sat- measurements are assigned to a grid point, the corresponding
ellite images of the surroundings of the sensor as local maps entry may contain their average. Those points with no associ-
[33]. Figure 6 provides an illustration of this kind of setup. ated measurements can simply be filled with physically
This input format lends itself to CNN architectures that unlikely values [32], [34], [35], or a separate binary mask
leverage spatial information in the vicinity of the sensor to pre- matrix can be included in the input [18], [36].
dict the received power. To learn across different environments Other maps with side information, such as the ones used in
where the transmitters possibly employ different transmit power, local DNN estimators, can also be appended to the input tensor
one can set the output of the network to be the gain between to the network. However, note that, now, these maps must be
each transmitter and the sensor and work out the received power global in the sense that they capture the entire region of interest.

Terrain Map

Building Map

p (x)
"

Distance to DNN
Source

Distance
From Sensor

FIGURE 6. A local DNN estimator, which provides pt (x) at a single x.

62 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


The global input can naturally be processed by CNN work may be challenging. To alleviate this difficulty, one may
architectures. The most common ones are autoencoders [18], resort to data augmentation or incorporate synthetic data from
[35] and UNets [13], [32]. The motivation for the former is ray-tracing simulators [18]. Another limitation deals with the
described in “Manifold Structure of Power Maps.” A global spatial resolution of the constructed maps. A high-resolution
DNN estimator is illustrated in Figure 7. map requires a dense grid, significantly increasing the compu-
Unlike local DNN estimators, a single forward pass of the tational complexity.
DNN produces the entire map. Furthermore, using map mea-
surements collected in multiple environments, the architecture Estimation of PSD maps
can readily learn across different RF scenarios. On the other PSD maps describe how the power is distributed not only
hand, collecting a sufficiently large dataset to train such a net- across space but also across the frequency domain. To estimate

Manifold Structure of Power Maps


Autoencoder networks are attuned to situations where the
λ = λavg
data lies on a low-dimensional manifold embedded in a 100
high-dimensional space. To see that this is the case of 80
radio maps, consider the values of a power map in 2D
60

y (m)
produced by two sources radiating with a fixed height and
power in free space. A dataset can be generated where 40
each map is obtained by placing the sources at random 20
locations on the horizontal plane. Each map is, therefore, 0
uniquely identified by the four scalars corresponding to the 0 50 100
x (m)
locations of the sources. If the maps are defined on a
32 # 32 grid, they comprise 32 2 = 1, 024 points, which S = {1, 2} S = {1, 3}
100 100
means that these maps lie on a manifold of dimension four
80 80
embedded in a space of dimension 1,024.
This observation is corroborated in [18] by training an 60 60
y (m)

autoencoder on the aforementioned dataset. An autoen- 40 y (m) 40


coder is the concatenation of an encoder and a decoder. 20 20
In this case, the encoder takes a 32 # 32 map and produc- 0 0
es a code vector m of length four. The decoder takes this 0 50 100 0 50 100
x (m) x (m)
vector at its input and aims at reconstructing the original
32 # 32 map. For a properly trained encoder and decod- S = {1, 4} S = {2, 3}
er, the output of the decoder closely resembles the input of 100 100
the encoder, which means that the code effectively con- 80 80
denses the information of the map in just four numbers. 60 60
y (m)

y (m)

Each value of the code identifies a point in the manifold. 40 40


The top panel of Figure S1 shows the output of the decod-
20 20
er when its input equals the average of the codes associat-
ed with each map in the dataset. The rest of the panels 0 0
0 50 100 0 50 100
show the output of the decoder applied to the result of per- x (m) x (m)
turbing the entries of this average code, indicated by
S = {2, 4} S = {3, 4}
index set S, by an amount equal to the standard devia- 100 100
tion of that entry across the dataset. This procedure yields 80 80
different points in the manifold. All panels approximately
60 60
y (m)

y (m)

correspond to maps of the kind composing the dataset,


which supports the manifold hypothesis. 40 40
If propagation does not take place in free space, or if the 20 20
power or height of the sources is variable, a longer code 0 0
needs to be utilized to capture all information in the maps. 0 50 100 0 50 100
x (m) x (m)
Experiments with other datasets reveal that, in the presence
of propagation phenomena such as shadowing and fading, FIGURE S1. The decoder outputs for the average code and its perturbed
radio maps lie close to a manifold of low dimension [18]. versions obtained with an autoencoder with a code length of four [18].

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 63


a PSD map p (x, f ), most schemes assume that the sensors Estimation in narrowband channels
measure the power that they receive at a set of frequencies When the width of the band of interest is small or moderate, it
f1, f, fN f . The nth measurement is, therefore, a vector makes sense to assume that the channel is not frequency selec-
m n = 6 pu (x n, f1), f, pu (x n, fN f )@<, where pu (x, f ) denotes the tive [27], [36]. This means that the true PSD map can be writ-
measured PSD at location x and frequency f, possibly obtained ten as p (x, f ) = R s h s (x) p TX TX
s ( f ), where h s (x) = h (x s , x) is
by using a periodogram or Welch’s method. Relying on these the channel gain from the sth transmitter to location x and
measurements, the goal is to obtain a PSD map estimate pt p TX
s ( f ) is the transmit PSD of the sth source. This implies
such that pt (x, f ) is as close to the true p (x, f ) as possible. To that the measurements essentially provide N f noisy linear
this end, several alternatives are explored next. combinations of the S functions h 1 (x), f, h S (x) . Therefore,
when N f & S, one can effectively exploit the structure in the
Separate estimation per frequency frequency domain, improving robustness to measurement
The simplest approach is to consider each frequency separately noise. One of the main benefits of this approach is that no
and essentially decompose the problem of estimating a PSD knowledge of the transmit PSD is required, as it can often be
map at N f frequencies as N f problems of estimating a single estimated using tools, such as nonnegative matrix factoriza-
power map [35]. More specifically, the nth power map is tion, without requiring any prior knowledge [36].
estimated from PSD measurements of
p (x 1, fn), f, p (x N , fn) using the tech- Estimation in wideband channels
niques described earlier. The main limita- PSD maps describe how For a wideband channel, one cannot realis-
tion of this approach is that it disregards any the power is distributed tically assume that the channel response is
structure in the frequency domain, making not only across space flat. To exploit the frequency domain struc-
it more sensitive to measurement noise than but also across the ture, one can utilize prior knowledge on the
other schemes explored later. On the upside, transmitter waveforms. Specifically, the
frequency domain.
these approaches are simple and do not PSDs of the transmitted waveforms are typ-
require prior knowledge on the channel or ically constrained by communication stan-
transmit PSD characteristics. Moreover, a twofold benefit aris- dards and spectrum regulations, which specify the bandwidth,
es in terms of the sizes of the training set and the number of carrier frequencies, transmission masks, roll-off factors, num-
parameters in schemes such as deep learning estimators. First, ber of subcarriers, and so forth [37]. Therefore, the transmit
provided that the propagation environment affects all frequen- PSD of a source can be approximated by a basis expansion
cies in a similar fashion, considering each frequency separately model (BEM) as p sTX ( f ) = R c b s, c z c ( f ), where z c denotes
will increase the number of training examples by a factor of the cth basis function, and b s, c is a nonnegative quantity. This
N f . On the other hand, if the neural network takes per-fre- decomposition is illustrated in Figure 8.
quency measurements as the input rather than processing all If the signals transmitted by different sources are uncor-
frequencies jointly, the number of parameters to be learned can related, the received PSD at a location x can be expressed
be significantly reduced [18, Sec. III-C1]. as p (x, f ) = R s h (x TX TX TX
s , x, f ) p s ( f ), where h (x s , x, f ) is

Terrain Map

p (x)
"

Sampling Mask

CNN
Measurements

Building Map

FIGURE 7. A global DNN estimator, which provides pt (x) for all values of on a grid.

64 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


the channel gain at frequency f. Then, using the BEM, one function of a 6D input, namely, the entries of x and xl. This
arrives at p (x, f ) = R c R s b s, c h (x sTX, x, f ) z c ( f ). If the band- means that the number of measurements necessary to attain
widths of the basis functions are small relative to the entire a given accuracy may be considerably greater than for esti-
band, it is reasonable to assume that h is approximately fre- mating a signal strength map—a manifestation of the curse
quency flat in the band of each basis function. This yields of dimensionality. Thus, as explored next, a number of algo-
h (x TX TX u u
s , x, f ) z c ( f ) . h (x s , x, fc ) z c ( f ), where fc is the rithms have been tailor made for propagation RME to allevi-
central frequency of the cth basis function. With this approx- ate such a difficulty.
imation, one can write p (x, f ) = R c p c (x) z c ( f ), where
p c (x) = R s h (x sTX, x, ufc) b s, c constitutes the power captured by Nontomographic approaches
the cth basis function at location x. In the nontomographic approaches, channel gains are direct-
Observe that introducing the BEM has reduced the ly modeled based on basic wireless propagation models
problem of estimating N f power maps to the problem without introducing any underlying auxiliary map. To main-
of estimating the C % N f power maps tain tractability, however, the RME prob-
p 1, f, p C . Clearly, the smaller C, the lem is often simplified by fixing one end
smaller the sensitivity to measurement Propagation maps quantify of a link. For example, one may consider
noise. The approaches in the preced- channel effects, such as estimating the maps {h n (x)|= h (x, x n)}n
ing sections can be seen as the extreme channel gains, for links for fixed positions {x n} nN= 1 where the
cases of choosing C = N f and C = 1, between arbitrary pairs sensors are located. The individual func-
respectively. To estimate a PSD map, the of locations where no tions {h n (x)} n can be estimated using
aforementioned technique can be used methods employed for signal strength
sensors may have been
in combination with virtually any of the maps. Since (static) signal strength map
approaches for power map estimation dis- deployed. estimation techniques have been explained
cussed earlier [2], [4], [8], [9]. A recent in the preceding sections, here, we extend
example is [18], where a BEM is used in the last layer of a the RME problem to include the time domain to capture the
DNN for RME. temporal variation of channel gains. Needless to say, static
The main limitation of estimators that rely on a BEM is channel gain maps can also be constructed in a nontomo-
a manifestation of the well-known bias–variance tradeoff. graphic fashion.
In particular, if the number of basis functions is small, the Consider the channel gain h n (x, t) between locations x and
approximation h (x TX TX u
s , x, f ) z c ( f ) . h (x s , x, fc) z c ( f ) may x n at time t [5]. Suppose that the effect of small-scale fading
not hold, which will generally result in estimation bias. On the has been averaged out, allowing h n (x, t) to be expressed in
other hand, if the number of basis functions is large, the repre- decibels as h n, dB (x, t) = h PL SF PL
n (x) - a n (x, t), where h n (x) is
SF
sentation capacity of the BEM is large, which results in a small the known path loss from x n to x, and a n (x, t) is the shadow
bias, but a larger variance must be expected as the result of the fading between x and x n at time t. Note that h PL n (x) can be
increase in the number of scalar maps to be learned for a fixed assumed known whenever both x n and x as well as the anten-
number of samples. na gains are known. Thus, the problem becomes tracking the
time-varying shadow fading map a SF n (x, t).
Estimation of propagation maps
Propagation maps quantify channel effects, such as channel
gains, for links between arbitrary pairs of locations where 0.25
no sensors may have been deployed. The nth measurement is 0.2
collected by a pair of sensors, one at location x n and the other
at xln . The channel gain of the link between them can possibly 0.15
PSD

be measured by employing pilot signals. The resulting mea- 0.1


surement can be expressed as m n = h (x n, xln) + z n, where h is
0.05
the true map, and z n represents the measurement noise. The
RME problem is to obtain an estimate ht of h given 0
{(x n, xln, m n)} nN= 1 . A good RME algorithm should have good 0 5 10 15 20 25 30 35
generalization properties, meaning that ht (x, xl ) . h (x, xl ) for Frequency [MHz] ( f )
all location pairs (x, xl ), even those for which no measure-
βs,1φ1( f ) βs,2φ2( f )
ments have been collected.
βs,3φ3( f ) βs,4φ4( f ) pTX
s (f)
Like signal strength RME, propagation RME is a func-
tion estimation problem. Therefore, the techniques described
in section “Estimation of Signal Strength Maps” can again FIGURE 8. A basis expansion model can be used to decompose a PSD as a
linear combination of functions in a basis. In this case, the basis functions
be employed in principle. The key difference is that now the are raised cosine functions, each one corresponding to a transmission in
function to be estimated depends on two locations rather than a different band. This makes it possible to exploit prior information about
one. If x denotes a location in 3D space, it is clear that h is a bandwidths, central frequencies, and transmission pulse shapes.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 65


To do this, shadow fading measurements are needed, which a{ nSF (t) = W n a n (t) + o n (t) + z n (t) (17)
can be obtained by subtracting the transmit power and the @ @
a n (t) = W n B n a n (t - 1) + W n h n (t). (18)
path loss from the received power measurement. By letting
N -n|= {1, f, n - 1, n + 1, f, N}, the noisy measurements
{a{ nSF (x j, t)} j of shadow fading obtained at time t by the sensor Here, } (x) |= [} 1 (x), f, } K (x)]<, a n (t) |= [a n, 1 (t), f,
<
at x n using the pilot signals sent from the radios at {x j} j ! N -n a n, K (t)] , and b n (x) is defined likewise. Vectors o n (t), z n (t),
can be expressed as and h n (t) are constructed in a similar fashion from
{o n (x j, t)} j ! N-n, {z n (x j, t)} j ! N-n, and {h n (x j, t)} j ! N-n, respec-
a{ SF SF
n (x j, t) = a n (x j, t) + z n (x j, t), j d N -n (14) tively. B n and W n are matrices constructed by, respectively,
arranging b n (x j)< and } (x j)< as rows for j ! N -n .
where z n (x j, t) is the zero-mean Gaussian measurement Based on (17) and (18), the LMMSE (or simply MMSE in this
noise. Upon defining a{ SF { SF
n (t)|= [a { SF
n (x 1, t), f, a n (x n - 1, t), case due to the joint Gaussianity assumption) estimate at n (t ; t)
SF SF
a{ n (x n + 1, t), f, a{ n (x N , t)] , the problem is to estimate
<
of a n (t) given A{ SF n (t) can be obtained via ordinary Kalman fil-
h n, dB (x, t) for arbitrary x based on the measurements tering, from which the temporally dynamic component n SF n (x, t)
˘ SF
A n (t)|= {a{ SF t
can be estimated as E {n SF { SF
n (x, t) ; A n (t)} = } (x) a
< t
n (x)}x = 1 up to time t. n (t ; t).
This problem can be tackled in the framework of kriged To capture o n (x, t) as well, a kriging estimator is employed
Kalman filtering, also known as space-time Kalman filter- (compare with the “Kriging” section). Overall, the LMMSE
estimate at SF SF { SF
ing [38]. Employing the log-normal shadowing model, it is n (x, t)|= E {a n (x, t) ; A n (t)} can be obtained
assumed that a SF n (x, t) is a Gaussian process with spatiotem- exploiting the covariance structure [5]. Once at SF n (x, t) is
poral dynamics [5], [38]: obtained, the channel gain map estimate ht n, dB (x, t) can be
constructed as ht n, dB (x, t) = h PL n (x) - a t nSF (x, t).
a SF SF
n (x, t) = n n (x, t) + o n (x, t) (15)
Tomographic approaches
SF
n n (x, t) = # w n (x, u) n SFn (u, t - 1) du + h n (x, t) (16) The radio tomographic model can be used to estimate shad-
ow fading maps. It postulates that the attenuation due to
where n SF n (x, t) is the spatiotemporally correlated compo- shadowing can be expressed in terms of an underlying aux-
nent, w n (x, u) captures the interaction of this component at iliary map termed the spatial loss field (SLF) [16], [39].
location x at time t and at location u at time (t - 1), and The SLF characterizes how much radio waves attenuate
o n (x, t) and h n (x, t) are spatially correlated but temporally when passing through each location and, hence, is specific
white zero-mean Gaussian processes. Process o n (x, t) is to each propagation environment [40]. Specifically, the
uncorrelated with z n (u, x) , and h n (x, t) is uncorrelated with radio tomographic model prescribes that the shadowing
o n (u, x) and z n (u, x) for all u and x. Moreover, attenuation between locations x and xl is given by the line
E {o n (x, t) n nSF (u, t)} = E {h n (x, t) n nSF (u, t - 1)} = 0 for all integral [40]
x, u, and t.
Since the state–space model in (15) and (16) is infinite dimen- a SF (x, xl ) = 1 #x
xl
F (xr ) dxr (19)
sional, adopt a BEM for tractability, as in universal kriging. For a x - xl
set of K orthonormal basis functions {} k (x)} k, n SF n and w n are,
K
respectively, approximated as n SF n (x, t) . R k = 1 a n, k (t) } k (x) where F : X " R + is the SLF. This naturally captures the
and w n (x, t) . R kK= 1 b n, k (t) } k (x) with expansion coefficients notion that nearby radio links generally experience similar
{a n, k (t)} and {b n, k (t)}. Substituting these expansions into shadowing due to the presence of common obstacles. Since
(14)–(16) and evaluating the resulting equations at {x j} j ! N -n this integral provides the shadowing attenuation between two
yields the finite dimensional state–space model: arbitrary locations, one does not need to fix one end of a link,
as in the nontomographic approach explained earlier.
Remarkably, time-varying maps can be readily accommodated
in the tomographic approach [16]. Furthermore, as the SLF
Grid Point Voxel
can reveal the locations of obstacles, the SLF itself can be use-
ful for various applications, such as device-free passive local-
ization [6], surveillance monitoring for intrusion detection
[41], and through-the-wall imaging for emergency or military
Piecewise operations [6].
Linear
Approximation Weight Function To approximate the line integral in (19), F can be discretized
Approximations on a regular grid of 2D or 3D spatial locations. One common
approach is to approximate this integral as a weighted sum of
the SLF values on the grid points that lie inside an ellipse or an
ellipsoid with foci at x and xl, as shown in Figure 9. The intu-
FIGURE 9. The possible approximations of the tomographic integral. ition is that the attenuation between the two endpoints should

66 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


be heavily affected by the obstacles around the line of sight Spectrum surveying
or, more specifically, within the so-called Fresnel zone, which To collect the measurements needed to build radio maps, tra-
is an ellipse whose geometry is dictated by the wavelength. ditionally, technicians in a vehicle with measurement equip-
Several functions have been proposed in the literature [41] to ment would drive around the site. With the advances in mobile
generate such weights, mainly based on heuristics. Alterna- robotics of the last decade, it is now possible to employ an
tively, the weights can be learned from the data through blind autonomous UAV with an onboard sensor to collect the
schemes [6]. desired measurements. This is clearly more efficient in terms
While easy to implement, this approximation yields shad- of time and personnel cost.
owing maps with discontinuities, as small changes in x and An important task is to plan the path traversed by the auton-
xl may lead to a change in the set of grid points that lie in omous UAV for acquiring measurements. A common approach
the ellipse. Even more, if the ellipse miss- is to define a grid and take measurements
es all of the grid points, as shown by the at each grid point. However, visiting each
left ellipse in Figure 9, the approximation The main limitation of grid point can be very time-consuming and
becomes zero. Thus, to attain good accu- estimators that rely on puts a strain on the limited battery capacity,
racy, the grid must be dense enough. a BEM is a manifestation especially when the grid is dense. A more
This motivates an alternative approach, of the well-known bias– efficient approach is to collect measure-
where the SLF is approximated as a piece- ments at a small set of highly informative
variance tradeoff.
wise constant function, taking a constant locations and apply the interpolation tech-
value within each grid cell (or voxel) [17]. niques discussed in previous sections to
The integral can then be computed as the weighted sum of construct the entire map. To this end, in addition to the map
the SLF values in the cells that the line of sight traverses. The estimate, RME algorithms need to provide an uncertainty
weight simply corresponds to the distance traversed in each map that indicates how informative a measurement would be
cell. This is illustrated by the colored line in Figure 9. This at each location given the measurements collected so far [22].
approximation involves less computational burden than the Based on the uncertainty map, a route planning algorithm can
one based on the ellipse, is continuous in x and xl, and does produce a trajectory through areas of high uncertainty. This
not vanish unless the SLF vanishes. Thus, the need for a dense approach achieves a much higher estimation quality in a given
grid is relaxed—which is particularly attractive in 3D [17]. surveying time (or requires a much shorter time for a given
In either approach, the shadowing attenuation is a linear quality) compared to the naive grid-based approach.
function of the SLF values at the grid points. Thus, the SLF Figure 10 illustrates an example of a surveying operation
can be estimated via (nonnegative) LS. using a ray-tracing dataset in a region of downtown Rosslyn,
However, this requires that the number of measurements VA. The three panels show the UAV trajectory seen from
be significantly larger than the number of grid points. One above. White boxes correspond to space occupied by buildings,
can mitigate this through appropriate regularizers [43] or by where no measurements can be taken. Red and white crosses
specifying a prior distribution in a Bayesian framework [44]. denote measurement locations. Figure 10(a) shows the ground
Another limitation of tomographic approaches is that only the truth power map in a setup with two transmitters. Figure 10(b)
attenuation due to absorption (shadowing) is accounted for. and (c) respectively show the estimated power map and the
Other propagation effects, such as reflection, refraction, and uncertainty map when only the measurements marked by red
diffraction, are completely ignored. crosses have been collected. At that point in time, the UAV

–70 5
80 80 80
4
Power (dBm)

–80 60 60 60
Uncertainty

3
y (m)

–90
40 40 40 2

–100 1
20 20 20
0
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
x (m) x (m) x (m)
(a) (b) (c)

FIGURE 10. An example of a surveying operation with an autonomous UAV in an urban environment seen from above: the (a) true power, (b) estimated
power, and (c) uncertainty metric. White boxes denote buildings. Red and white crosses denote measurement locations.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 67


plans a trajectory through areas of high uncertainty, repre- dense urban scenarios, where localization errors may reach
sented by white crosses. The estimator, in this case, is a global tens of meters.
DNN estimator capable of learning the nature of propagation This phenomenon is illustrated in Figure 11, where the
phenomena from a dataset (see the “Deep Learning” section). x-coordinates of the location estimates are compared in a sce-
nario without multipath [Figure 11(a)] and with multipath
Practical considerations [Figure 11(b)]. The localization algorithm is based on the time
In this section, some challenges that arise when implementing difference of arrival between the pilot signals arriving from each
RME techniques in practical setups are discussed. These pair of transmitters; see [45] for details. The poor quality of the
include coping with localization errors, nonisotropic antenna location estimates in Figure 11(b) hinders the application of con-
patterns, decentralized implementation, and reducing the ventional RME techniques. This is because the maps are indexed
bandwidth required to collect measurements. by the locations (e.g., the input for the power map p is x) and, thus,
the localization error in x propagates to the output p (x).
Localization errors The key realization is, therefore, that x is not suitable as the
The RME schemes described earlier typically require accu- “index” of the map. To mitigate this issue, one can resort to the
rate knowledge of the measurement locations. In practice, the so-called location-free cartography framework [45].
sensor locations are themselves estimated based on localiza- To motivate this framework, it is worth stepping back and
tion systems, such as GPS, in the following way. A number recalling that the location estimates are produced by a local-
of transmitters with known locations, such as satellites or ization algorithm based on the pilot features. It is sensible,
cellular base stations, regularly transmit signals termed therefore, to bypass this step and directly use the pilot features
localization pilots. Each sensor then extracts certain features to index the map since these features evolve more smoothly
from the pilots to estimate their locations. For example, the across space than the location estimates.
received signal strength or the propagation delay, which con- Once such a map has been estimated, there are two
tain information on the distance to the transmitters, are used approaches to evaluate it at a given location. If a terminal is
to produce location estimates based on geometric principles. present at that location, it can directly employ the features of
Thus, the quality of the estimates can be significantly the pilot signals. If no sensor is present, one can interpolate the
degraded due to multipath propagation, as in indoor and features, e.g., based on the low-rank prior.
Due to the larger input dimension of the map function,
location-free RME requires a larger number of measurements
than location-based approaches in the absence of localization
x
"

30 errors. Another difficulty is that the availability of the features


depends on the availability of the pilot signals. However, one
20
can reconstruct the missing features [45] or define features that
y (m)

10 can be extracted from regular communication signals, e.g., the


ones broadcast by cellular base stations, rather than from dedi-
0 cated localization pilots [46].
–10
0 20 40 Antenna patterns
x (m)
So far, we assumed that a signal strength map p (x) provides
(a)
the power received by a sensor with an isotropic antenna at
x location x. If the antenna pattern is not isotropic, the mea-
"

30
sured power will depend on the sensor orientation. For this
20 reason, it may be convenient to estimate the angular spectrum
map p (x, i), which provides the angular power density
y (m)

10
received by a sensor at location x from direction i. Here, i
0 parameterizes the direction through, e.g., the azimuth and ele-
vation angles.
–10
0 20 40 If C (i - il ) denotes the antenna gain along direction i for a
x (m) sensor with orientation il , it follows that the power received by
(b) such a sensor when placed at x will be # C (i - il ) p (x, i) di.
If the sensor orientations associated with all measurements
FIGURE 11. (a) Free space. (b) The scenario with four walls. The color of are known, then each measurement is a noisy linear observa-
each point indicates the x-coordinate of the location estimate obtained by tion of p (x, i) and, therefore, the latter can be estimated. The
a sensor at that location. The black circles indicate the positions of the techniques described earlier for PSD map estimation in the
transmitters. In (a), the estimate accurately matches the true coordinate
when there is no multipath; thus, this image serves as a color bar. On
frequency domain can be adapted to this end, possibly upon
the other hand, in (b), the estimation error is large in the presence of discretizing the aforementioned integral. Specifically, p (x, i)
multipath. (Source: [45]; used with permission.) can be estimated for a discrete set of angle bins separately or

68 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


by parameterizing p (x, i) by means of a BEM with standard ments. An alternative is to employ distributed in-network pro-
or tailored basis functions along the lines of [4, Sec. III-A], cessing, where all sensors collaboratively estimate the map via
although the choice of suitable basis functions seems to war- local interactions; i.e., the nth sensor, n ! N |= {1, f, N},
rant further research. exchanges information only with its set of single-hop
The challenges emerging in this approach are twofold. neighbors N n 1 N [2], [5], [9], [16], [47]. The key idea is
First, due to the curse of dimensionality and the fact that that RME tasks often boil down to a regression problem of
function p (x, i) takes the additional input i, a significantly the form
larger number of measurements may be required to estimate
minimize 1 y - Xi
2
p (x, i) relative to p (x). Second, sensors need to be able to 2 + p (i) (20)
i 2
measure their orientation, e.g., through accelerometers and
magnetometers, which affects the cost and introduces addi- where y ! R M and X ! R M # H represent the targets and the
tional error sources. regressors, respectively; i ! R H contains the regression coef-
A pragmatic alternative is to treat the sensor orientations as ficients; and p ($) is a convex regularizer that captures prior
random variables with uniform distribution over orientations information; see, e.g., (8). It is often the case that the data X
i. This implies that the isotropic power map p (x) equals the and y consist of the collection of the data {X n} and {y n} from
expectation of p (x, i) and, thus, one can still estimate p (x) the individual sensors. That is, y = [y 1<, f, y <N ]<, where
using the procedures described in previous sections upon y n ! R M n f o r n ! N a n d R nN= 1 M n = M. L i kew i s e ,
disregarding orientation. The uncertainty introduced by the X = [X <1 , f, X <N ]< with X n ! R M n # H for n ! N.
directionality of the antennas translates into additional mea- To solve (20) in a decentralized manner, consider first an
surement noise, which, therefore, increases the number of mea- undirected graph G|= (N, E) with vertex set N and edge set
surements required to estimate p (x) with a target accuracy. E, where vertices represent sensors, and the edge (n, nl ) is in
This is the price to be paid for circumventing the aforemen- E whenever sensors n and nl can communicate in a single
tioned limitations. hop—i.e., nl ! N n . If G is connected—i.e., there is a (pos-
sibly multihop) path between every pair of sensors—it can be
Decentralized implementation easily shown that (20) is equivalent to
Unlike conventional spectrum sensing techniques, which often
minimize / 8 1 y n - X n c n + 1 p (i n)B (21a)
N
2
assume a common spectrum occupancy over the entire area of 2
interest [15], spectrum cartography accounts for spatial vari- {i n, c n, c (n,nl)}
n=1
2 N
ability. Thus, it is necessary that the measurements be subject to c n = i n, n ! N (21b)
obtained at various locations {x n}nN= 1 within the region, which i n = c (n, nl ) = i nl, nl ! N n, n ! N, (21c)
then must be processed jointly. While this can be achieved, in
theory, by collecting the measurements at a fusion center (FC) where {c n} and {c (n, nl )} are auxiliary variables. Per (21b), c n
for centralized processing, the feedback overhead and the is just a copy of i n . In addition, {c (n, nl )} facilitate the deriva-
associated delay can be significant in practice. Moreover, the tion of simple update rules and are eventually eliminated. A
FC must operate with higher resource and security require- decentralized algorithm can be derived by applying the

Γn,1 ( f )
Measure
Quantize
Power
f

Γn,2 ( f )
Measure
Quantize
Power
f
Down-
Conversion Concatenate To FC

Γn,L ( f )
Measure
Quantize
Power
f

FIGURE 12. To reduce the rate necessary to report measurements, sensors may use a bank of random filters. The energy of each filter is measured,
quantized, and sent to an FC that performs RME.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 69


alternating-direction method of multipliers to (21a)–(21c). I f the t r ue powers pr 1, f, pr N were k nown ­e xactly,
Following steps similar to those in [9, Appendix D], one can one could seek RK HS functions pt 1, f, pt C such that
obtain the decentralized update rules for iteration k as [pt 1 (x n), f, pt C (x n)] z n = pr n 6n using kernel-based learning.
Now, consider the case where, instead of pr 1, f, pr N , one
u [nk] = u [nk - 1] + t / ^i [nk] - i [nkl] h (22a) has the quantized measurements m { 1, f, m { N , but it holds that
nl ! N n
pu n = pr n for all n; i.e., there is no measurement noise. Each m { n,
[k]
mn = mn
[k - 1 ]
+ t ^i [nk] - c [nk] h (22b) therefore, indicates which quantization interval contains pr n .
Upon denoting the endpoints of the interval that contains m {n
= argmin 1 p (i) + 1 i - a n 2 (22c)
[k + 1 ] 2
in as a (m { n) and b (m { n), it makes sense to now seek pt 1, f, pt C that
i Nc n 2
satisfy [pt 1 (x n), f, pt C (x n)] z n ! [a (m { n), b (m{ n)] for all n.
= ^ tI H + X <n X nh ^ X <n y n + ti [nk + 1] + m [nk]h (22d)
[k + 1 ] -1
cn Finally, in the case where there is measurement noise, pu n
is generally different from pr n . If the noise is small relative to
where t 2 0 is the step size, c n|= t (1 + 2 ; N n ;), and the width of the quantization interval, the result of quantiz-
ing either value will be often the same, but not always. This
a n|= 1 c t / ^i [nk] + i [nkl] h + tc [nk] - u [nk] - m [nk] m (22e) means that one cannot impose that [g 1 (x n), f, g C (x n)] z n
cn nl ! N n
necessarily falls in the quantization interval [a (m s n), b (m
s n)].
for n ! N. As can be seen in (22a) and (22e), the updates Instead, the condition must be encouraged in a soft manner
involve only local communication with the neighbors. The by penalizing deviations from the interval. Interestingly, by
proximal problem in (22c) admits a closed-form solution for penalizing deviations in a linear fashion, it can be shown that
various common choices of p ($). It can be proven that the iter- the resulting estimates can be obtained through support vec-
ate i [nk] for any n ! N converges to the solution of (20) as tor regression [4].
k " 3 [9]. The previous considerations can be extended to the case
where the filter bank at each sensor contains L > 1 filters, as
Rate constraints depicted in Figure 12. Observe that now two subscripts are
To maintain up-to-date maps, every certain time interval, the necessary to index each branch.
sensors need to collect new measurements, which are then The power at the lth branch of the sensor at x n is given
sent to a FC or shared with other nodes. For maps that change by pr n, l = [p 1 (x n), f, p C (x n)] z n, l . Since all of the vectors
rapidly over time or require high-dimensional measurements, z n, 1, f, z n, L multiply the same [p 1 (x n), f, p C (x n)], the val-
such as PSD maps, the bandwidth required to report the mea- ues pr n, l are not fully informative about p 1, f, p C unless the
surements may be significant. To mitigate such an issue, com- vectors z n, 1, f, z n, L are linearly independent. This imposes a
pression and quantization can be employed [4]. design constraint on the filters. For example, filters with pseu-
The idea is twofold. First, instead of directly computing dorandom impulse responses may be utilized and are expected
PSD estimates at the sensors, each sensor measures the pow- to yield linearly independent vectors z n, 1, f, z n, L so long as
ers at the outputs of a filter bank acting on the received sig- L # C.
nal. Then, only quantized versions of those measurements are
reported; see Figure 12. Future directions
To simplify the exposition, assume, for now, that each sensor Although RME has been the subject of a sizable research
employs a single filter. Recall that p (x, f ) denotes the PSD at body, a large number of open issues still remain. First of all,
location x. If the received signal at location x n is processed by a the potential of radio maps to endow applications with radio
filter with frequency response C n (f) , the output power is given situational awareness is yet to be fully exploited. A large
by pr n|= # p (x n, f ) ; C n ( f ) ; 2 df. Due to the measurement part of the progress in this regard has taken place in the con-
noise, the measured value pu n will be generally different from text of device-free localization (see, e.g., [43]) and UAV
the true pr n . Subsequently, pu n is quantized to m { n, which is then communications (see, e.g., [17] and references therein), but
sent to the FC or other sensors. This clearly requires a much a number of tasks arising in cellular networks, such as
smaller bandwidth than sending, e.g., the entire periodogram. resource allocation, are yet to be explored. Radio maps can
To see how the map can be estimated from these linearly also be used as priors for enhanced channel estimation in
compressed and quantized measurements, recall the decom- mobile communications.
position p (x, f ) = R c p c (x) z c ( f ) from section “Estimation in Improving inference biases in data-driven radio map esti-
Wideband Channels”). mators is also necessary. This can be achieved by collecting
Since the basis functions z c are known, this decomposi- extensive datasets in multiple bands since most works so far
tion reduces the problem of estimating p to that of estimating rely on synthetic data generated with ray-tracing software.
C functions p 1, f, p C . It also follows that pr n can be written Such datasets would also open the door to devising improved
as pr n = R c p c (x n) # z c ( f ) ; C n ( f ) ;2 df = [p 1 (x n), f, p C (x n)] z n, uncertainty metrics for spectrum surveying. Remarkably, these
where the cth entry of vector z n is # z c ( f ) ; C n ( f ) ;2 df. can be used for improving spectrum surveying techniques [22].
In other words, pr n is a linear combination of the values that Furthermore, hybrid model-based and data-driven approaches
the functions p 1, f, p C take at x n . have the potential to combine the best of both worlds [33]. The

70 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


rationale is that radio propagation models may significantly radio maps were considered based on whether the received sig-
reduce the amount of data required to train data-driven esti- nal strength or the propagation channel effects are of interest, and
mators, whereas learning from data can significantly improve a large number of representative applications were discussed.
the accuracy of model-based approaches. Tutorial expositions of various data-driven methods for RME
Methods for coping with various sources of error must also have been presented, ranging from parametric, nonparametric,
be devised. For example, time variations may be better pre- and probabilistic approaches to recent powerful deep learning
dicted by exploiting side information on the mobility of ter- techniques, incorporating useful priors, such as sparsity, low
minals. In this context, trajectories of ground vehicles on the rank, and union-of-subspace structures. Practical issues related to
road or UAVs in aerial corridors may be instrumental to reduce spectrum surveying, noisy location estimates, decentralized
the effective dimensionality of propagation maps. One can also implementation, and limited-rate measurements were also dis-
model how groups of persons or vehicles move to better pre- cussed. With the advent of ultradense and ultradynamic deploy-
dict signal strength maps as a whole. Other sources of error to ment scenarios often envisioned in future wireless networking,
counter include the use of antennas with two polarizations and the role of data-driven spectrum cartography enabled via sophis-
nonisotropic gain patterns. ticated RME techniques will likely become even more relevant.
Recent developments adopt machine learning algorithms
to predict the channel state information (CSI) of desired mul- Acknowledgment
tiantenna channels based on pilot CSI. This approach can cap- This research was funded, in part, by the Research Council of
ture the characteristics of small-scale fading, going beyond the Norway under IKTPLUSS grant 311994.
channel gain maps. In [46], the pilot CSIs are obtained from a
set of links that are different from the target link. The optimal Authors
transmit beam pattern of the desired link is predicted based Daniel Romero ([email protected]) received his M.Sc.
on the acquired CSIs. When the source and the target links and Ph.D. degrees in signal theory and communications from
are not colocated, the traditional assumption is that the CSIs the University of Vigo, Spain, in 2011 and 2015, respectively.
are statistically independent. In reality, there can be signifi- From 2015 to 2016, he was a postdoctoral researcher with the
cant dependency between the CSIs and the geometry of the Digital Technology Center and Department of Electrical and
propagation environment, transceiver locations, the line-of- Computer Engineering, University of Minnesota, USA. Then,
sight path, and other multipaths within the coherence time of he joined the Department of Information and Communication
the channels. Given sufficiently rich pilot CSI measurements Technology, University of Agder, Grimstad 4879, Norway, as
that capture the relevant geometry, an appropriate nonlinear an associate professor. His research interests include machine
mapping (e.g., via a DNN) can exploit this dependency. As learning, artificial intelligence, optimization, signal process-
a related idea, channel charting obtains, in an unsupervised ing, and aerial communications.
fashion, low-dimensional embeddings of the high-dimension- Seung-Jun Kim ([email protected]) received his B.S. and
al CSIs that approximately provide the spatial locations of the M.S. degrees from Seoul National University in 1996 and
measurements [48]. 1998, respectively, and his Ph.D. degree from the University
Finally, further types of radio maps may also be explored. of California, Santa Barbara in 2005, all in electrical engineer-
For example, maps may be developed for massive multiple- ing. During 2005–2008, he was with NEC Labs America,
input, multiple-output and millimeter-wave networks to ben- Princeton, NJ. During 2008–2014, he was with the University
efit from reduced search time for beam selection. As another of Minnesota, where his final title was research associate pro-
example, exploring delay Doppler maps could be instrumental fessor. He joined the Department of Computer Science and
in the context of resource allocation for the emerging orthogo- Electrical Engineering, University of Maryland, Baltimore
nal time frequency and space modulation. County, Baltimore, MD 21250 USA in 2014. He is a senior
area editor for IEEE Signal Processing Letters. His research
Related work interests include signal processing, machine learning, and
The interested reader can delve deeper into RME through the optimization with applications to wireless communication,
surveys [42] and [49]. In [49], the focus is on occupancy maps, power systems, and medical imaging.
which are radio maps that provide the fraction of time that a
certain frequency channel is used. On the other hand, the References
[1] A. Alaya-Feki, S. B. Jemaa, B. Sayrac, P. Houze, and E. Moulines, “Informed
authors of [42] focus on power map estimation and review spectrum usage in cognitive radio networks: Interference cartography,” in Proc.
other methods that are not discussed here due to space limita- IEEE Int. Symp. Personal, Indoor Mobile Radio Commun., Cannes, France, Sep.
2008, pp. 1–5, doi: 10.1109/PIMRC.2008.4699911.
tions. Relative to these works, the present tutorial is more
[2] J.-A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive
introductory in nature and considers more classes of maps as radio networks by exploiting sparsity,” IEEE Trans. Signal Process., vol. 58, no. 3,
well as more recent methods. pp. 1847–1862, Mar. 2010, doi: 10.1109/TSP.2009.2038417.
[3] H. B. Yilmaz, T. Tugcu, F. Alagöz, and S. Bayhan, “Radio environment map as
enabler for practical cognitive radio networks,” IEEE Commun. Mag., vol. 51, no.
Conclusions 12, pp. 162–169, Dec. 2013, doi: 10.1109/MCOM.2013.6685772.
Radio maps characterize important metrics of the RF spec- [4] D. Romero, S.-J. Kim, G. B. Giannakis, and R. López-Valcarce, “Learning
trum landscape across a geographical area. Two families of power spectrum maps from quantized power measurements,” IEEE Trans. Signal

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 71


Process., vol. 65, no. 10, pp. 2547–2560, May 2017, doi: 10.1109/ [28] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing
TSP.2017.2666775. overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process.,
[5] S.-J. Kim, E. Dall’Anese, and G. B. Giannakis, “Cooperative spectrum sensing vol. 37, no. 23, pp. 3311–3325, Dec. 2006, doi: 10.1109/TSP.2006.881199.
for cognitive radios using Kriged Kalman filtering,” IEEE J. Sel. Topics Signal [29] S.-J. Kim and G. B. Giannakis, “Cognitive radio spectrum prediction using dic-
Process., vol. 5, no. 1, pp. 24–36, Feb. 2011, doi: 10.1109/JSTSP.2010.2053016. tionary learning,” in Proc. IEEE Global Commun. Conf., Atlanta, GA, USA, Dec.
[6] D. Romero, D. Lee, and G. B. Giannakis, “Blind radio tomography,” IEEE 2013, pp. 3206–3211, doi: 10.1109/GLOCOM.2013.6831565.
Trans. Signal Process., vol. 66, no. 8, pp. 2055–2069, Jan. 2018, doi: 10.1109/ [30] S.-J. Kim and G. B. Giannakis, “Dynamic learning for cognitive radio sens-
TSP.2018.2799169. ing,” in Proc. 5th IEEE Int. Workshop Comp. Adv. Multi-Sensor Adapt. Process.,
[7] C. Phillips, D. Sicker, and D. Grunwald, “Bounding the practical error of path St. Martin, French Caribbean, Dec. 2013, pp. 388–391, doi: 10.1109/
loss models,” Int. J. Antennas Propag., vol. 2012, pp. 71–82, Jun. 2012, doi: CAMSAP.2013.6714089.
10.1155/2012/754158. [31] V. V. Ratnam et al., “FadeNet: Deep learning-based mm-wave large-scale chan-
[8] E. Dall’Anese, J.-A. Bazerque, and G. B. Giannakis, “Group sparse lasso for nel fading prediction and its applications,” IEEE Access, vol. 9, pp. 3278–3290,
cognitive network sensing robust to model uncertainties and outliers,” Phy. 2021, doi: 10.1109/ACCESS.2020.3048583.
Commun., vol. 5, no. 2, pp. 161–172, Jun. 2012, doi: 10.1016/j.phycom.2011.07.005. [32] E. Krijestorac, S. Hanna, and D. Cabric, “Spatial signal strength prediction
[9] J.-A. Bazerque, G. Mateos, and G. B. Giannakis, “Group-lasso on splines for using 3D maps and deep learning,” in Proc. IEEE Int Conf. Commun., 2021, pp.
spectrum cartography,” IEEE Trans. Signal Process., vol. 59, no. 10, pp. 4648– 1–6, doi: 10.1109/ICC42927.2021.9500970.
4663, Oct. 2011, doi: 10.1109/TSP.2011.2160858. [33] J. Thrane, D. Zibar, and H. L. Christiansen, “Model-aided deep learning
[10] C. Parera, Q. Liao, I. Malanchini, C. Tatino, A. E. C. Redondi, and M. Cesana, method for path loss prediction in mobile communication systems at 2.6 GHz,”
“Transfer learning for tilt-dependent radio map prediction,” IEEE Trans. Cogn. Commun. IEEE Access, vol. 8, pp. 7925–7936, Jan. 2020, doi: 10.1109/ACCESS.2020.
Netw., vol. 6, no. 2, pp. 829–843, Jan. 2020, doi: 10.1109/TCCN.2020.2964761. 2964103.
[11] T. Imai, K. Kitao, and M. Inomata, “Radio propagation prediction model using [34] Q. Niu, Y. Nie, S. He, N. Liu, and X. Luo, “RecNet: A convolutional network
convolutional neural networks by deep learning,” in Proc. IEEE Eur. Conf. for efficient radiomap reconstruction,” in Proc. IEEE Int. Conf. Commun., Kansas
Antennas Propag., Krakow, Poland, Apr. 2019, pp. 1–5. City, MO, USA, May 2018, pp. 1–7, doi: 10.1109/ICC.2018.8422971.
[12] M. Iwasaki, T. Nishio, M. Morikura, and K. Yamamoto, “Transfer learning- [35] X. Han, L. Xue, F. Shao, and Y. Xu, “A power spectrum maps estimation
based received power prediction with ray-tracing simulation and small amount of algorithm based on generative adversarial networks for underlay cognitive
measurement data,” 2020, arXiv:2005.00833. radio networks,” Sensors, vol. 20, no. 1, p. 311, Jan. 2020, doi: 10.3390/
s20010311.
[13] R. Levie, Ç. Yapar, G. Kutyniok, and G. Caire, “RadioUNet: Fast radio map
estimation with convolutional neural networks,” IEEE Trans. Wireless Commun., [36] S. Shrestha, X. Fu, and M. Hong, “Deep generative model learning for blind
vol. 20, no. 6, pp. 4001–4015, 2021, doi: 10.1109/TWC.2021.3054977. spectrum cartography with NMF-based radio map disaggregation,” in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 4920–4924, doi: 10.1109/
[14] D. Romero and G. Leus, “Non-cooperative aerial base station placement via ICASSP39728.2021.9413382.
stochastic optimization,” in Proc. IEEE Mobile Ad-Hoc Sensor Netw., Shenzhen,
China, Dec. 2019, pp. 131–136. [37] D. Romero and G. Leus, “Wideband spectrum sensing from compressed mea-
surements using spectral prior information,” IEEE Trans. Signal Process., vol. 61,
[15] E. Axell, G. Leus, and E. G. Larsson, “Overview of spectrum sensing for cog- no. 24, pp. 6232–6246, Dec. 2013, doi: 10.1109/TSP.2013.2283473.
nitive radio,” in Proc. 2nd Int. Workshop Cogn. Inf. Process., 2010, pp. 322–327,
doi: 10.1109/CIP.2010.5604136. [38] K. V. Mardia, C. Goodall, E. J. Redfern, and F. J. Alonso, “The Kriged
Kalman filter,” Test, vol. 7, no. 2, pp. 217–285, Dec. 1998, doi: 10.1007/
[16] E. Dall’Anese, S.-J. Kim, and G. B. Giannakis, “Channel gain map tracking via BF02565111.
distributed kriging,” IEEE Trans. Veh. Technol., vol. 60, no. 3, pp. 1205–1211, Feb.
2011, doi: 10.1109/TVT.2011.2113195. [39] D. Lee, S.-J. Kim, and G. B. Giannakis, “Channel gain cartography for cogni-
tive radios leveraging low rank and sparsity,” IEEE Trans. Wireless Commun., vol.
[17] D. Romero, P. Q. Viet, and G. Leus, “Aerial base station placement leveraging 16, no. 9, pp. 5953–5966, Jun. 2017, doi: 10.1109/TWC.2017.2717822.
radio tomographic maps,” May 2022, arXiv:2109.07372.
[40] N. Patwari and P. Agrawal, “NeSh: A joint shadowing model for links in a
[18] Y. Teganya and D. Romero, “Deep completion autoencoders for radio map esti- multi-hop network,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
mation,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1710–1724, Aug. 2021, Las Vegas, NV, USA, Mar. 2008, pp. 2873–2876, doi: 10.1109/ICASSP.
doi: 10.1109/TWC.2021.3106154. 2008.4518249.
[19] B. Schölkopf and A. J. Smola, “Learning with Kernels: Support vector [41] B. R. Hamilton, X. Ma, R. J. Baxley, and S. M. Matechik, “Propagation model-
machines,” in Regularization, Optimization, and Beyond, Cambridge, MA, USA: ing for radio frequency tomography in wireless networks,” IEEE J. Sel. Topics
MIT Press, 2001. Signal Process., vol. 8, no. 1, pp. 55 – 65, Feb. 2014, doi: 10.1109/
[20] D. Romero, S.-J. Kim, and G. B. Giannakis, “Stochastic semiparametric JSTSP.2013.2287471.
regression for spectrum cartography,” in Proc. IEEE Int. Workshop Comput. Adv. [42] M. Pesko, T. Javornik, A. Kosir, M. Stular, and M. Mohorcic, “Radio environ-
Multi-Sensor Adapt. Process., Cancun, Mexico, Dec. 2015, pp. 513–516, doi: ment maps: The survey of construction methods,” KSII Trans. Internet Inf. Syst.,
10.1109/CAMSAP.2015.7383849. vol. 8, no. 11, pp. 3789–3809, Dec. 2014, doi: 10.3837/tiis.2014.11.008.
[21] A. Agarwal and R. Gangopadhyay, “Predictive spectrum occupancy probabili- [43] J. Wilson, N. Patwari, and O. G. Vasquez, “Regularization methods for radio
ty-based spatio-temporal dynamic channel allocation map for future cognitive wire- tomographic imaging,” in Proc. Virginia Tech Symp. Wireless Personal Commun.,
less networks,” Trans. Emerg. Telecommun. Technol., vol. 29, no. 8, p. e3442, Jun. Blacksburg, VA, USA, Jun. 2009, pp. 1–9.
2018, doi: 10.1002/ett.3442.
[44] D. Lee, D. Berberidis, and G. B. Giannakis, “Adaptive Bayesian channel gain
[22] R. Shrestha, D. Romero, and S. P. Chepuri, “Spectrum surveying: Active radio cartography,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Calgary,
map estimation with autonomous UAVs,” IEEE Trans. Wireless Commun., early Canada, Apr. 2018, pp. 3555–3558, doi: 10.1109/ICASSP.2018.8461412.
access, Aug. 2022, doi: 10.1109/TWC.2022.3197087.
[45] Y. Teganya, D. Romero, L. M. Lopez-Ramos, and B. Beferull-Lozano,
[23] M. Gudmundson, “Correlation model for shadow fading in mobile radio sys- “Location-free spectrum cartography,” IEEE Trans. Signal Process., vol. 67, no. 15,
tems,” Electron. Lett., vol. 27, no. 23, pp. 2145–2146, Nov. 1991, doi: 10.1049/ pp. 4013–4026, Aug. 2019, doi: 10.1109/TSP.2019.2923151.
el:19911328.
[46] Z. Jiang, S. Chen, A. F. Molisch, R. Vannithamby, S. Zhou, and Z. Niu,
[24] D.-H. Huang, S.-H. Wu, W.-R. Wu, and P.-H. Wang, “Cooperative radio source “Exploiting wireless channel state information structures beyond linear correlations:
positioning and power map reconstruction: A sparse Bayesian learning approach,” A deep learning approach,” IEEE Commun. Mag., vol. 57, no. 3, pp. 28–34, Mar.
IEEE Trans. Veh. Technol., vol. 64, no. 6, pp. 2318–2332, Jun. 2015, doi: 10.1109/ 2019, doi: 10.1109/MCOM.2019.1800581.
TVT.2014.2345738.
[47] S.-J. Kim, N. Jain, G. B. Giannakis, and P. Forero, “Joint link learning and
[25] B. Khalfi, B. Hamdaoui, and M. Guizani, “AirMAP: Scalable spectrum occu- cognitive radio sensing,” in Proc. Asilomar Conf. Signal, Syst., Comput., Pacific
pancy recovery using local low-rank matrixapproximation,” in Proc. IEEE Grove, CA, USA, Nov. 2011, pp. 1415–1419, doi: 10.1109/ACSSC.2011.6190250.
GLOBECOM, Abu Dhabi, UAE, Dec. 2018, pp. 206 –212, doi: 10.1109/
GLOCOM.2018.8647667. [48] J. Deng, O. Tirkkonen, J. Zhang, X. Jiao, and C. Studer, “Network-side local-
ization via semi-supervised multi-point channel charting,” in Proc. Int. Wireless
[26] D. Schäufele, R. L. G. Cavalcante, and S. Mtanczak, “Tensor completion for Commun. Mobile Comput. Conf., Harbin, China, Jul. 2021, pp. 1654–1660, doi:
radio map reconstruction using low rank and smoothness,” in Proc. IEEE SPAWC, 10.1109/IWCMC51323.2021.9498723.
Cannes, France, Jul. 2019, pp. 1–5, doi: 10.1109/SPAWC.2019.8815495.
[49] M. Höyhtyä et al., “Spectrum occupancy measurements: A survey and use of
[27] G. Zhang, X. Fu, J. Wang, and M. Hong, “Coupled block-term tensor decom- interference maps,” IEEE Commun. Surveys Tuts., vol. 18, no. 4, pp. 2386–2414,
position based blind spectrum cartography,” in Proc. Asilomar Conf. Signal, Syst., Apr. 2016, doi: 10.1109/COMST.2016.2559525.
Comput., Pacific Grove, CA, USA, Nov. 2019, pp. 1644–1648, doi: 10.1109/
IEEECONF44664.2019.9048667.  SP

72 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


SP FORUM
Luigi Longobardi  , Tony VenGraitis  , and Christian Jutten 

Scientific Integrity and Misconduct in Publications


Guidance from the IEEE Publishing Ethics Team

E
thics and scientific integrity are key ■■ During the COVID-19 pandemic, elaborated collectively by these two
concepts in scientific research. many results based on bad practices, persons of the IEEE Publishing Ethics
Ethics in science concerns e.g., rejection of patients with risks, Team (PET).
reflections on what is morally good voluntary wrong interpretation of SPM: Let us begin the discussion
or bad and on the values that motivate results, and so on, have been pub- with the topic of well-known and serious
our actions and their consequences. It lished in journals and largely distrib- scientific misconduct, which includes
appeals to our sense of morality and uted in social networks: these are data fabrication, data falsification,
responsibility toward society, humans, examples of data falsification, a typ- and plagiarism. Can you comment on
and, more generally, living beings. ical and serious misconduct in terms these three areas of misconduct?
Scientific integrity concerns the right of scientific integrity. PET: Misconduct takes many forms
way of conducting research practices. IEEE Societies, particularly the and may involve the actions of authors,
It is mandatory for society to enhance IEEE Signal Processing Society, are editors, reviewers, and publishers. As
confidence in scientists and in the sci- strongly involved in the respect of sci- you pointed out, the U.S. Office of
ences. It is also very important when entific integrity. Details about the main Research Integrity defines research mis-
scientists serve as mentors for young rules and good practices for authors as conduct as the “fabrication, falsification,
researchers and, thus, are responsible well as reviewers, associate editors, and or plagiarism in proposing, performing,
for teaching good practices and setting editors-in-chief (EICs) are detailed in or reviewing research, or in reporting
examples of integrity for their students the IEEE Publication Services and Prod- research results.” All of these consti-
and collaborators. ucts Board (PSPB) Operations Manual, tute conduct or behavior that violates or
These two concepts, ethics and scien- which is available at https://pspb.ieee. compromises the ethical standards and
tific integrity, are sometimes confused, org/images/files/files/opsmanual.pdf. expectations determined by the scien-
although they are very dissimilar. In That 138-page-long manual is tific publishing community. It is impor-
short, we can claim that ethics is related very complete extension. Nevertheless, tant that all parties are aware of what is
to philosophical reflections capable of most past and current IEEE Members expected of them and are educated and
opening discussions, whereas scientific have never read it. In this forum article, informed regarding industry best prac-
integrity is related to a set of good prac- we propose to establish a discussion tices, guidelines, and codes of conduct.
tices that must be rigorously and always about different facts of scientific and When authors submit articles to
applied, without discussion. publishing misconduct to discover rules IEEE, they are expected to comply
As examples, we have the following. of scientific integrity and what you can with IEEE policy regarding authorship
■■ The relevance of experiments on liv- do when you face such misconduct. For responsibilities. Among these responsi-
ing beings and that of implementing this purpose, IEEE Signal Processing bilities are the needs to ensure that data
invasive brain–computer interfaces Magazine (SPM) EIC Christian Jutten are accurate and free from inappropri-
for augmenting the capacities of interviewed Luigi Longobardi, direc- ate influence and that any republished
well-being for people are considered tor of Publishing Ethics and Conduct at content has been properly referenced
to be ethical issues. IEEE, and Tony VenGraitis, program and cited. While it is not always possible
manager of Publication Ethics, IEEE for editors and peer reviewers to evalu-
Publications, and they answered some ate the validity of data, it is still impor-
Digital Object Identifier 10.1109/MSP.2022.3196764
Date of current version: 27 October 2022 questions. The answers have been tant to examine the article’s premise and

1053-5888/22©2022IEEE IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 73


conclusions for any signs of suspicious manuscript simultaneously to more than without clear delineation, e.g.,
information. Some signs include unusu- one publication, characterizing a multi- quotes or indents.
ally slanted findings or data trends, data ple submission. In both cases, authors are Along with each of the plagiarism
that are too perfect or too “noisy,” or sit- given a warning if it is their first offense. levels, there are correlating corrective
uations in which authors are unwilling But repeated incidents could result in actions. These can range from posting a
or unable to provide the raw data. more severe corrective actions, such as a Notice of Violation in our online publica-
SPM: Can you also comment on one-year ban from publishing. tion database, to the prohibition of publi-
what self-plagiarism is, and in what SPM: What are the tools that IEEE cation, to submitting a letter of apology
sense it is misconduct? has been using for detecting such mis- to the author of the original work.
PET: When writing a new article, conduct? SPM: In the “publish or perish”
scientists are generally expected to PET: IEEE uses Similarity Check, trends, another misconduct concerns
make substantive contributions that which detects similarities between a authorship. Although usually less seri-
are distinct from their earlier ones. submitted manuscript and previously ous than data fabrication, data falsi-
However, the close relationship among published content. The tool compares fication, and plagiarism, authorship
articles often requires authors to repeat against a database of participating STM misconduct is unacceptable. Can you
some content. For some time, this prac- publishers as well as online resources. explain the rules for being a coauthor of
tice has inaccurately been called self- Similarity Check will identify only the a paper?
plagiarism, while it should, in effect, be similar text, not any images or equa- PET: According to the IEEE PSPB
considered as text recycling. Common tions set as figures. It is necessary for Operations Manual, authorship credit
examples include descriptions of meth- the editors to review reports that trigger must be reserved for individuals who have
ods, materials, or statistical tests; back- alerts of high similarity levels because met each of the following conditions:
ground information; and discussions of the detected text may be properly reused 1) They have made a significant
prior relevant research. by the authors. It is also important for intellectual contribution to the
Text recycling can be defined as the reviewers and editors to be alert for theoretical development, system
reuse of material, including text, visuals, signs of image or data falsification. or experimental design, proto-
or equations, in a new document, where These may not be readily noticeable, but type development, and/or analy-
1) The material in the new docu- if there is something about the images sis and interpretation of data
ment is identical, or basically or data that seems unusual, it is worth associated with the work con-
equivalent in form and content, bringing it to the attention of the EIC. tained in the article.
to that of the source. In many cases, we will find out about 2) They have contributed to drafting
2) The material is not presented in content issues after publication, either the article or reviewing and/or
the new document as a quotation. from a whistleblower or from authors revising it for intellectual content.
3) At least one author of the new who see their copied content, and then 3) They have approved the final ver-
document is also an author of the an investigation will be required with sion of the article as accepted for
prior document. subject matter experts who can evaluate publication, including references.
Under this definition, text recycling the allegations. SPM: What are the other typical
can be appropriate or inappropriate, SPM: What are the five levels of pla- examples of author misconduct?
depending on the details of each case. giarism defined in IEEE and their cor- PET: If one excludes data fabrica-
As authors develop their papers, they rective actions? tion or falsification and plagiarism,
will normally refer to their previously PET: To help in creating consistency examples of author misconduct include
published articles as part of the evolu- while addressing plagiarism, IEEE has ■■ providing fake suggested reviewers
tionary process of the work. For example, defined multiple “levels” based on the ■■ undisclosed conflict of interest in
a conference article will often be further severity of the misconduct. work described within an article
developed and expanded into a journal ■■ Level One: Uncredited verbatim ■■ gift/ghost authorship, i.e., author
article. This is an acceptable practice, copying in more than half of the credit given or sold to an individual
and IEEE encourages this development submitted paper without any technical contributions
of an author’s work. However, authors ■■ Level Two: Uncredited verbatim to the article
are expected to fully reference any prior copying of 20%–50% in a submitted ■■ multiple submission, i.e., submitting
publications when submitting new work. paper an article to more than one publica-
Authors are also required to inform EICs ■■ Level Three: Uncredited verbatim tion simultaneously
of the prior work and how the new sub- copying of individual elements, e.g., ■■ multiple publication, i.e., resubmit-
mission differs from the earlier article. paragraphs, sentences, or figures ting a published article as a new
It would be unacceptable for authors to ■■ Level Four: Uncredited improper manuscript without disclosing it to
reuse the same text without any reference paraphrasing of pages or paragraphs the editor.
to the past work, characterizing a mul- ■■ Level Five: Credited verbatim copy- SPM: Publication processes involve
tiple publication, or to submit the same ing of a major portion of a paper other actors, namely associate editors,

74 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


who manage the reviewing processes; but for now the best way to detect these SPM: Has IEEE designed some
reviewers, who analyze papers and incidents is through careful observation MOOCs or education tools for authors,
make recommendation for rejection, by all publishing volunteers involved in reviewers, editors, and so on, for explain-
revision, or acceptance; and EICs, who the editorial process. If something does ing and avoiding misconduct?
are responsible for the journal publica- not look right, or if there is suspicious PET: The PET is collaborating with
tion. Can you explain the possible types behavior by an editor, guest editor, or the Author Engagement staff to create
of misconduct by reviewers, associate reviewer, it should be reported to the a collection of editor training materials
editors, and EICs? PET. Supporting data and communica- that will cover many of the topics we
PET: The following are examples of tions can be sent to help the PET in iden- have discussed in this article. This goal
editorial misconduct for reviewers and tifying whether there is misconduct and was initiated following PSPB approval
editors, with the relevant section loca- in bringing the matter to the appropriate of new policy requiring all incoming
tion of policy in the IEEE PSPB Opera- publication officers for further review. EICs to receive training on publishing
tions Manual: SPM: Most of the readers are proba- ethical misconduct. We expect to have
■■ Citation stacking: Requiring authors bly not aware of what they can do in the the first portion of this editor training
to add unrelated publication refer- face of such misconduct. What should available by the end of 2022. For authors
ences to their manuscripts for the authors do if they discover their work who may want more information on a
benefit of the volunteer or a relevant has been plagiarized? What should variety of topics related to appropriate
publication (8.2.2.A.4) authors do when they discover anoma- authorship behavior, there is an Author
■■ Gift authorship: Requiring authors lies in authorship? What should authors Center that is a valuable resource of all
to add unrelated coauthor names to do when detecting misconduct by a things related to publishing with IEEE:
their manuscripts for the benefit of reviewer, an associate editor, or an EIC? https://ieeeauthorcenter.ieee.org/.
the volunteer or another author PET: In all three cases, authors or
(8.2.1.A.1) editors should contact the PET by way Authors
■■ Conflict of interest: Accepting editori- of the Ethics Reporting website at www. Luigi Longobardi (l.longobardi@ieee.
al responsibilities despite having a ieee-ethics-reporting.org/, or they can org) received his undergraduate degree in
clear and direct relationship with the contact the team directly by e-mail at physics from the Università degli Studi di
author or manuscript and then provid- [email protected] if they have ques- Napoli “Federico II” and his M.A. and
ing an inappropriate influence on the tions or need immediate advice. Ph.D. degrees from Stony Brook
decision-making process (8.2.2.A.2) All matters of publication miscon- University. He was a Marie Curie
■■ Breach of confidentiality: Using or duct must be handled fairly and con- researcher in Italy and a postdoctoral
disclosing information from a sub- fidentially, and individuals who are researcher at Dartmouth College. His
mitted manuscript not yet released suspected of misconduct must be given research focused on quantum informa-
to the public (8.2.2.B.1/8.2.2.C.4) an opportunity to respond to any allega- tion technologies and engineering using
■■ Neglecting editorial responsibilities: tions before any actions can be taken. superconducting devices. He transitioned
Accepting or rejecting manuscripts IEEE policy requires misconduct cases into publishing in 2012 at the American
without completing the standard to be reviewed by an ad hoc committee Physical Society working on the Physical
peer-review process (8.2.1.C.2). of subject matter experts appointed by Review journals as associate editor, then
Other examples of reviewer impro- the editor or the sponsoring Society/ journal manager, and finally assistant edi-
priety include Section senior publication officers to torial director. He then became executive
■■ misrepresenting facts in a review determine whether there is miscon- editor at AIP Publishing (AIPP) in
and unfairly criticizing a competi- duct, and if so, what type of corrective Melville, NY, USA. He has built exper-
tor’s work actions are needed. Examples of cor- tise on peer review, journal development,
■■ proposing changes that appear to rective actions that may be applicable portfolio expansion, mentoring and train-
merely support the reviewer’s own would include ing of editors, publishing ethics, and
work or hypotheses ■■ a ban from publishing in all IEEE research misconduct. Since January
■■ unreasonably delaying the review publications for an appropriate period 2022, he has been director of publication
process ■■ restriction from editorial duties ethics and conduct for IEEE.
■■ using ideas or text from a manu- ■■ retraction of articles from IEEE Tony VenGraitis (a.vengraitis@
script under review Xplore ieee.org) received his B.A. degrees in
■■ including personal or ad hominem ■■ adding a Notice of Violations to arti- English and education from Seton Hall
criticism of the authors. cles in IEEE Xplore University. He now serves as the pro-
SPM: How does IEEE detect these ■■ assigning a training course on avoid- gram manager for the Publication Ethics
types of misconduct and what are their ing publishing misconduct. Team at IEEE, Piscataway, NJ USA
punishments? The PET can provide useful check- 08854. Prior to joining IEEE, he worked
PET: IEEE is exploring automated lists and report templates to guide EICs
ways of detecting editorial misconduct, with their investigations. (continued on page 84)

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 75


TIPS & TRICKS
Kai Wu  , J. Andrew Zhang , and Y. Jay Guo

Fast and Accurate Linear Fitting for an Incompletely


Sampled Gaussian Function With a Long Tail

F
itting experiment data onto a curve is a been applied in transferring the Gaussian of the iterative WLS substantially im-
common signal processing technique fitting into a linear fitting [4]. However, proved, but its accuracy is also greatly
to extract data features and establish the problem of the logarithmic transfor- enhanced, particularly for noisy and in-
the relationship between variables. Of- mation is that it makes the noise power completely sampled Gaussian functions
ten, we expect the curve to comply with vary over data samples, which can result with long tails. These will be demon-
some analytical function and then turn in biased Gaussian fitting. The weighted strated by simulation results.
data fitting into estimating the unknown least square (WLS) fitting is known to be
parameters of a function. Among analyti- effective in handling uneven noise back- Prior art and motivation
cal functions for data fitting, the Gaussian grounds [7]. However, as unveiled in [5], Let us start by elaborating on the sig-
function is the most widely used one due the ideal weighting for linear Gaussian nal model. A Gaussian function can be
to its extensive applications in numerous fitting is directly related to the unknown written as
science and engineering fields. To name Gaussian function. To this end, an itera-
(x - n) 2
just a few, the Gaussian function is highly tive WLS is developed in [5], starting f (x) = Ae
-
2v 2 , (1)
popular in statistical signal processing with using the data samples (which are
and analysis, thanks to the central limit noisy values of a Gaussian function) as where x is the function variable, and
theorem [1], and the Gaussian function weights and then iteratively reconstruct- A, n, and v are the parameters to be
frequently appears in the quantum har- ing the weights using the previously esti- estimated. They represent the height,
monic oscillator, quantum field theory, mated function parameters. location, and width of the function, re-
optics, lasers, and many other theories For the iterative WLS, the number spectively. Directly fitting f(x) can be
and models in physics [2]; moreover, the of iterations required for a satisfactory cumbersome due to the exponential
Gaussian function is widely applied in fitting performance can be large, par- function. A well-known opponent of
chemistry for depicting molecular orbit- ticularly when an incompletely sampled the exponential is the natural logarithm.
als, in computer science for imaging pro- Gaussian function with a long tail is Indeed, by taking the natural logarithm
cessing, and in artificial intelligence for given [see Figure 1(a) for such a case]. of both sides of (1), we can obtain the
defining neural networks. Establishing a good initialization is a following polynomial after some basic
Fitting a Gaussian function, or, sim- common strategy for improving the rearrangements:
ply, Gaussian fitting, is consistently of convergence speed and performance
ln ^ f (x)h = a + bx + cx 2, (2)
high interest to the signal processing of an iterative algorithm. Noticing the
community [3]–[6]. Since the Gaussian unavailability of a proper initialization where the coefficients a, b, and c are
function is underlain by an exponential for the iterative WLS, we aim to fill the related to the Gaussian function param-
function, it is nonlinear and not easy to blank by developing a high-quality one eters n, v, and A. Based on (1) and (2),
be fitted directly. One effective way of in this article. To do so, we introduce a it is easy to obtain
counteracting its exponential nature is to few signal processing tricks to develop
n = - b ; v = -1 ; A = e a - b /(4c) .
2

apply the natural logarithm, which has high-performance initial estimators 2c 2c


for the three parameters of a Gaussian (3)
function. When our initial fitting results We see that the estimations of n, v, and
Digital Object Identifier 10.1109/MSP.2022.3194692
Date of current version: 27 October 2022 are applied, not only is the efficiency A can be done through estimating a, b,

76 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 1053-5888/22©2022IEEE


and c. Since a, b, and c are coefficients where the first-order Taylor series notes the pointwise product. Due to
of a polynomial, they can be readily ln (1 + x) . x is applied to get the ap- the use of these vector/matrix forms,
estimated by employing linear fitting proximation, and ln ^ f [n]h is written the estimators reviewed in the follow-
based on, e.g., the least square (LS) into a polynomial form based on (2). ing sections look different from their
criterion [7]. Next, we review several linear fitting descriptions in the original work. How-
In modern signal processing, we gener- methods through which the motivation ever, regardless of the forms, they are
ally deal with noisy digital signals. Thus, of this work will be highlighted. the same in essence.
instead of f(x) given in (1), the following For the sake of conciseness, we em-
signal is more likely to be dealt with: ploy vector/matrix forms for the sequen- LS fitting
tial illustrations. In particular, let us The first fitting method [4] reviewed
y [n] = f [n] + p[n], s.t. f [n] = f (nd x), define the following three vectors: here applies LS on (7) to estimate the

n = 0, f, N - 1, (4) three unknown coefficients in i. The
 y = 6ln ^ y [0]h, ln ^ y [1]h, f, solution is classical and can be writ-
where n is the sample index, d x is the ln ^ y [N - 1]h@T, ten as [7]
sampling interval of x, and p [n] is an ad- x = 60, d x, f, (N - 1) d x@T, 
i = 6a, b, c@T . it = X y = ^ X X h X y, (8)
@ T -1 T
ditive white Gaussian noise. If we take the (6)
natural logarithm of y [n], we then have
Then, based on (5), the linear Gauss- where X @ denotes the pseudoinverse
ln ^ y[n]h = ln ^ f [n] + p[n]h ian fitting problem can be conveniently of X. The LS fitting is simple but not
written as without problems. As can be seen from
= ln c f [n] c 1 + mm,
p[n]

y = Xi + p, s.t. X = 61, x, x 9 x@,


f [n] (5), the additive white Gaussian noise
p [n] is divided by f [n]. The division
. ln ^ f [n]h +
p[n] (7)
 f [n] can severely increase the noise power at
p[n] where p denotes a column vector ns with f [n] . 0, causing the noise en-
= a + bd x n + cd 2x n 2 + , stacking the noise terms p[n] /f [n] hancement problem. As a consequence
f [n]
(5) (n = 0, 1, f, N - 1) in (5), and 9 de- of the problem, LS can suffer from poor

1 1
f (x)

f (x)

0.5 0.5
Tail Region

0 0
0 5 9 10 0 5 10
x x
(a) (b)

Principal
1 1
Region
f (x)

f (x)

0.5 0.5

0 0
0 5 6 10 0 5 10
x x
(c) (d)

FIGURE 1. (a) The noisy samples of the Gaussian function with A = 1, n = 9, and v = 1.3 are plotted, where the dashed curve is the function without
noise. (b) The iterative WLS, as illustrated in (9), is run 10 times, each time with 12 iterations and independently generated noise, leading to the fitting
results shown. (c) and (d) Other than resetting n = 6 , the same results are plotted as in (a) and (b), respectively. As shown, the range of x is [0, 10],
where the sampling interval is set as d x = 0.01.

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 77


fitting performance, particularly when that some fitting results still substan- mization is achieved. According to the
the majority of samples are from the tail tially differ from the true function, even middle relation in (3), c can be estimated
region of a Gaussian function, such as after 12 iterations. In contrast, perform as ct = -1/2vt 2 . Thus, the work in [6] re-
the one plotted in Figure 1(a). the same fitting as described but change moves c from the parameter vector i
n to six (equivalently reducing the tail given in (6) and employs an iterative
Iterative WLS fitting region). The fitting results of 10 inde- WLS, similar to (9) but with reduced
To solve the noise enhancement issue, pendent trials are plotted in Figure 1(d). dimension, to estimate a and b.
the authors of [5] propose replacing LS Obviously, the results look much better A major error source of vt obtained
with WLS. In contrast to LS, which than those in Figure 1(b). in (11) is the approximation error from
treats each sample equally, WLS applies A common solution to reducing the using the summation in (11) to approxi-
different weights over samples. The pur- number of iterations required by an itera- mate the integral given in (10). Even if
pose of weighting is to counterbalance tive algorithm is a good initialization. In d x is fine enough, the approximation
noise variations. For this purpose, f [n] our case, the quality of the initial weight can still be problematic, depending on
is the ideal weight, as seen from (5). vector, i.e., w 0 in (9), can affect the over- the proportion of the tail region in the
However, f [n] is the digital sample of all number of iterations required by the sampled Gaussian function. For exam-
an unknown Gaussian function. iterative WLS to converge. Moreover, ple, if the sampled function has a shape
To solve the problem, an iterative if the way of initializing w 0 can be im- similar to the one given in Figure 1(c),
WLS is developed in [5], where y [n] mune to the proportion of the tail region, we know that the summation can well
(the noisy version of f [n]) is used as the the convergence performance of the it- approximate the integral given a fine
initial weight. From the second itera- erative WLS can then be less dependent d x . However, if the sampled function
tion, the weight is constructed using the on the proportion of the tail region. Our has a shape like the curve in Figure 1(a),
previous estimates of a, b, and c based work is mainly aimed at designing a way the approximation error will be large re-
on the relation depicted in (2). Let w i of initializing the weight vector of the it- gardless of d x . A condition is given in
denote the weighting vector at the ith erative WLS so as to reduce the number [6] stating when the summation in (11)
iteration, collecting the weights over of overall iterations and relieve the de- can well approximate the integral in
n = 0, 1, f, N - 1. Then, the iterative pendence of fitting performance on the (10). Nevertheless, the question “What
WLS can be executed as proportion of the tail region. shall we do when the condition is not
satisfied?”—which can be inevitable in
it i = ^ X i X i h X i y i, i = 0, 1, f An interesting but not-good-enough
-1
T T practice—is yet to be answered.
s.t. y i = w i 9 y; X i = 6w i, w i 9 x, initialization
Proposed Gaussian fitting
w i 9 x 9 x@
An initialization for iterative WLS-

based linear Gaussian fitting is de- Looking at Figure 1(a), we know the
 w i = e y if i = 0; otherwise, veloped in [6], which was originally summation in (11) cannot approximate
t
w i = e Xi i - 1, (9) motivated by separately fitting the pa- the integral in (10). Now, instead of
rameters of a Gaussian function. In par- using the summation to approximate
where x and y are given in (6) and X in ticular, exploiting the following relation something unachievable, how about
(7). The exponential is calculated point- looking into a different question: “What
wise in the last row. #-33 f (x) dx = A 2r v, (10) can be approximated using the sum-
Provided a sufficiently large number mation given in (11)?” We answer this
of iterations, iterative WLS can achieve a the work proposes a simple estimation question by performing the following
high-performance Gaussian fitting, gen- of v, as given by computations (which look complex but
erally better than LS. However, the more are easily understandable):
N-1
iterations, the more time-consuming 1
iterative WLS would be. Moreover, the
vt = t
A
/ y [n] d x, (11) N-1 ( a) N - 1 ( b)
required number of iterations to achieve
n=0
/ y [n] d x . / d x f [n] . #0 N dx
f (x) dx
n=0 n=0
the same fitting performance changes where y [n] is the noisy sample of the (c)
#0 #
n Nd x
with the proportion of the tail region. Gaussian function to be estimated, d x is = f (x) dx + f (x) dx
n
Take the Gaussian function in Fig- the sampling interval of x, and the sum-
= 1
(d) 2n

ure 1(a) for an illustration. The tail re- mation approximates the integral of f(x) 2
#0 f (x) dx
t is esti-
gion is about half the whole sampled described earlier. Moreover, A +1 #
Nd x
f (x) dx
 2 n - (Nd x - n)
region. Perform 12 numbers of itera- mated by
erf c m
tions based on (9) for 10 trials, each (e) 2r Av n
6A
t , nt @ = max y [n] . (12) =
adding independently generated noise 2 v 2
n
onto the same Gaussian function with Nd x - n
erf c m,
2r A v
+
A = 1, n = 9, and v = 1.3. The fitting Similar to how max ($) works in MAT- 2 v 2
results are given in Figure 1(b). We see LAB [8], nt is the index where the maxi- (13)

78 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


where erf ($) is the so-called error func- ence of the nonelementary function Making the substitution of nt d x = kv
tion. It can be defined as [1] erf ($), analytically solving v from the in (16), we obtain
equations is nontrivial. Moreover, we
erf ^ z h = 2 #0 z e -t dt.(14)
2
have two equations but one unknown. t t dx
2r An
r Sb . erf c k m . (17)
How to constructively exploit the infor- 2k 2
How each step in (13) is obtained is de- mation provided by both equations is
tailed as follows: also a critical problem. Here, we first Clearly, the dependence of the erf ($)
(a)
■■ .| This step replaces y [n] with develop an efficient method to estimate function on n (represented by nt d x and
f [n] by omitting the noise term v from either equation in (16), resulting v, as shown in (16), is now removed,
p [n], as given in (4). in two estimates of v; we then derive an making the erf ($) function solely re-
(b)
■■ .| This is how the integral is often asymptotically optimal combination of lated to the coefficient k. Therefore, a
introduced in a math textbook. the two estimates. significance of the variable substitution
(b)
While the left side of . approxi- is that one lookup table of erf ($) can be
mately calculates the area below Efficient estimation of v applied to a variety of applications with
f(x), the right side does so exactly. The two equations in (16) have the same different Gaussian function parameters.
(c)
■■ =| The integrating interval is split structure. Therefore, let us focus on the Assuming k = k * makes the right-hand
into contiguous halves. top one for now. While solving v ana- side of (17) closest to S b, v can then be
(d )
■■ =| The integrating interval of each lytically is difficult, numerical means estimated as nt d x /k *. Similarly, we can
(c)
integral in = is doubled in such a can be resorted to. As is commonly make the substitution for the bottom
way that it becomes symmetric done, we can select a large region of equation in (16) and obtain another esti-
against x = n. Since f(x) is also v, discretize the region into fine grids, mate of v. In summary, the two estima-
symmetric against x = n, see (1): evaluate the values of the right-hand tors can be established as in (18) shown
the scaling coefficient 1/2 can side in (16) on the grids, and identify at the bottom of the page. The two es-
counterbalance the extension of the the grid that leads to the closest result timates would have different qualities,
integrating interval. to S b . depending on how many samples are
(e)
■■ =| It is based on a known fact [1]: These steps are regular but not prac- used for each. This further suggests that
tically efficient. This is because evaluat- combining them is not as trivial as sim-
#- f (x) dx = 2r Av erf c m.
ing the right-hand side of the equation ply averaging them. Next, we develop a
n+e e
n e v 2 in (16) needs the calculation of erf ($) constructive way of combining them.
for each v grid. From (14), we see that
(c)
Similar to = in (13), we can also split erf ($) itself is an integral result. If we Asymptotically optimal v estimation
the summation R nN=-01 y [n] d x . Doing so, calculate erf ($) onboard, it would be The two estimates obtained in (18) are
the two integrals on the right-hand side approximated by a summation over mainly differentiated by how many
(c)
of = in (13) can be, respectively, ap- sufficiently fine grids of the integrat- samples are used in their estimations.
proximated by ing variable. This can be highly time- Therefore, identifying the impact of the
consuming, particular given that erf ($) employed samples on the estimation
nt - 1
/ y [n] d x needs to be calculated for each entry performance is helpful in determining
9
Sb = and
n=0 in a large set of v grids. Alternatively, a way to combine the estimates. One of
N-1 
we may choose to store a lookup table the most common performance metrics
/ y [n] d x,
9
Sa = (15)
n = nt
of erf ($) onboard. This is doable but can for an estimator is the Cramér–Rao low-
also be troublesome, for the reason that er bound (CRLB) [7]. This points the
where nt is obtained in (12). (Note that the parameter of erf ($), as dependent on direction of our next move.
nt d x is an estimate of n.) Moreover, the parameters of the Gaussian function Let us check the CRLB of vt b first.
tracking the computations in (13), we to be fitted, can span over a large range It is estimated using the samples y [n]
can easily attain in different applications. The trouble, for n = 0, 1, f, nt - 1. Referring to (4),
however, can be relieved through a sim- y [0], y [1], f, y [nt - 1] are jointly nor-
tv
2r A
erf c nd x m;
t ple variable substitution. mally distributed with different means
Sb .
2 v 2
tv 
2r A
erf c Nd x - nd x m, (16)
t
Sa .
2 v 2
t (N - nt ) d x 2
; s.t. k * : argmin e S a - erf c k mo ;
(N - nt ) d x 2r A
where A and n have been replaced by vt a =
k * k 2k 2
their estimates given in (12). The two 2 
t
vt b = *x ; s.t. k : argmin e S b - erf c k mo .
nt d * t
2r And x
equations in (16) provide possibilities (18)
k k 2k 2
for estimating v, which is the only un-
known left. However, due to the pres-

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 79


but the same variance, as given by v 2p . E "^vt - v h2 , = E "^ tvt a + (1 - t) vt b can be approached as the estimation
Recall that v 2p is the power of the noise
- tv - (1 - t) vh2 ,
SNR increases.
term p [n] (6n) given in (4). Since we
focus here on investigating the estima- = t 2 E "^vt a - vh2 , Improving the estimation
tion performance of v, we assume that  + (1 - t) 2 E "^vt b - v h2 , performance of A and n
A and n are known. Then, the CRLB + t 2 CRLB " vt a , So far, we have focused on introducing
of vt b estimation, as obtained based on + (1 - t) CRLB " vt b ,,
2 the novel estimation of v. From (18), we
y [0], y [1], f, y [nt - 1], can be com- (23) can see that the proposed v estimation
puted by requires the estimations of A and n, i.e.,
where `a + b_ denotes that a asymptot- t and nt d x therein. To ensure a clear
A
2
CRLB " vt b , =
vp ically approaches b or a constant linear logic flow, we used the naive way of
/ c 22f [vn] m
nt - 1 2
scaling of b. estimating these two parameters, as de-
n=0 Solving 2E "^vt - v h, /2t = 0 leads scribed in (12). Here, we illustrate some
vp
2 to the following optimal t: simple yet more accurate methods for
 = nt - 1 , estimating A and n.
/ f [n] 2 ( n - d x n) 4 CRLB " vt b , For n estimation, we introduce a local
CRLB " vt b , + CRLB " vt a ,
*
n=0 t =
v
6 averaging to reduce the impact of noise.
N-1
(19) Define a rectangular window function
/ f [n] 2 ( n - d x n)4
n = nt as W [n] = (1/L) for n = 0, 1, f, L - 1
= N-1 ,
where the middle result is a simple and W [n] = 0 for other ns. The local av-
application of [7, Eq. (3.14)], and the
/ f [n] 2 ( n - d x n)4
eraging can be performed by using the
n=0
first partial derivative of f [n] can be window function to filter the sampled
readily derived based on its expression where the CRLB expressions (19) and Gaussian function, i.e., y [n] given in
given in (4). With reference to (19), we (20) have been applied. The optimal- (4). An improved n  estimation can be
can directly write the CRLB of vt a, as ity of t * can be validated by plugging achieved by searching for the peak of
*
given by t = t into (23). Doing so yields the filtered Gaussian function and then
constructing using the peak index. This
E "^vt - v h , + N - 1
2 2
CRLB " vt a , =
vp 2 vp is expressed as
N-1 , .
 / f [n] 2 ( n - d x n)4  / f [n] 2 (n - d x n) 4
nt = ` nt + 8
2 Bj x
nt n=0 L d , s. t. nt : argmax
6 6 n
v v
(20) (24) L-1 
/ W [l] y [n + l]. (26)
where t he sole d i f ference com- From the index ranges of the summa- l=0
pared with (19) is the set of ns for tions in (19), (20), and (24), we can see
the summation. that the right-hand side of (24) becomes The offset 6L / 2@ is added because,
Jointly inspecting the two CRLBs, the CRLB of the v estimation that is in theory, if the sum of continuous L
we see that they only differ by a linear obtained based on all samples at hand— samples is maximum, those samples
coefficient; namely, the best estimation performance for would be centered around the peak of
N-1
any unbiased estimator of v based on a Gaussian function. Based on (26), we

CRLB " vt b ,
/ f [n] 2 ( n - d x n) 4 y [0], y [1], f, y [N - 1] . That is, taking
*
also obtain an estimate of A, i.e.,
n = nt t = t in (22) leads to an asymptotical-
CRLB " vt a ,
= .
nt - 1
t = y 8nt + 8 L BB . (27)
 / f [n] 2 ( n - d x n) 4 ly optimal unbiased v estimation. A
2
n=0 Note that f [n] used for calculating
(21) *
t is unavailable. Thus, we replace f [n] Use these two estimates obtained in
An insight from this result is that we with its noisy version y [n], attaining the the proposed v estimators, as given in
only need a linear combination of the following practically usable coefficient: (18). Then, combine the two estimates,
two v estimates obtained in (18) as done in (22), with the optimal com-
N-1
to achieve an asymptotically optimal bining coefficient given in (25). This
combined estimation. To further illus-
/ y [n] 2 (n - d x n) 4 results in the final v estimate. Un-
* n = nt
t O N-1 . (25)
trate this, let us consider the following like nt and vt obtained using multiple
linear combination:
/ y [n] 2 (n - d x n) 4 samples, A t given in (27) is based on
n=0
a single sample and, hence, can suffer
vt = tvt a + (1 - t) vt b, t ! (0, 1). (22) If we define A 2 /v p2 as the estimation from a large estimation error. Noticing
signal-to-noise ratio (SNR), where this, we suggest another refinement of
The mean square error (MSE) of the 2
v p is the power of the noise term t through minimizing the MSE, which
A
combined estimate can be computed as p [n] in (4), then the equality in (25) is calculated as

80 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


/ ` xe - - y [n] j ,
N-1 2
1 (nd x - nt ) 2 can be downloaded from https://www. incompletely sampled. From Figure 2(a),
2vt 2
N n=0 icloud.com/iclouddrive/005qLeEe1Y we see a significant improvement in M3,
Oghnz4Om24ct76w\# publish _v2. as compared with M1. This validates the
with respect to x. The solution to the Unless otherwise specified, the pa- advantage of using all samples for A esti-
minimization is an improved A esti- rameters summarized in Table 2 are mation, as developed in (28).
mate, as given by primarily used in our simulations. We remark that, as a price paid for
For the original iterative WLS, we performance improvement, the pro-
N-1
e / e- y [n] o
(nd x - nt ) 2
2vt 2
perform 12 iterations so that it can posed initial fitting requires slightly
n=0 achieve a similar asymptotic perfor- more computational time than M1; see
At = . (28)
`e- jo
(nd x - nt ) 2 2
e/
N-1 mance as M4 in the high-SNR region. the last column of Table 1 for com-
2vt 2
n=0
(This will be seen shortly in Figure 2.) parison. However, it is noteworthy that,
In contrast, when the initialization thanks to the signal processing tricks
We summarize the proposed Gauss- from either M1 or (the proposed) M3 introduced in the “Efficient Estimation
ian fitting method in Table 1 under the is employed, only two iterations are of v ” section, the proposed v estima-
code name M3. As mentioned at the end performed for the iterative WLS al- tor, as established in (18), only involves
the “Iterative WLS Fitting” section, we gorithm. Note that the running time simple floating point arithmetic that can
locate our method as an initial-stage for each method is provided in Table 1, be readily handled by modern digital
fitting. For the second stage, we per- where each time result is averaged signal processors or field programming
form the iterative WLS with the initial over 105 independent trials. gate arrays.
weighting vector, i.e., w 0 in (9), con- Figure 2 plots the MSEs of the five es- From Figure 2, we further see that
structed using our initial fitting results. timators listed in Table 1 against the es- the proposed initial fitting results
This two-stage fitting is named M4 in timation SNR, as given by A 2 /v p2 . Note enable the iterative WLS to achieve
Table 1. We underline that, due to the that n  and v are randomly generated for much better performance for all three
high quality of the proposed initializa- the 105 independent trials, conforming parameters than other initializations.
tion, M4 can converge much faster than to the uniform distributions given in Ta- The improvement is particularly no-
the original iterative WLS, named M5 ble 2. As given in Table 2, x ! [0, 10] is ticeable in low-SNR regions. More-
in Table 1. This will be validated shortly set in the simulation. Thus, the settings over, we underline that M4 based on
by simulation results. Also provided in of n and v make the Gaussian function the proposed initialization only runs
the table is the v estimation method re- to be fitted in each trial incompletely two iterations, while the original itera-
viewed in the “An Interesting but Not- sampled with a long tail; a noisy version tive WLS, i.e., M5, runs 12 iterations.
Good-Enough Initialization” section, of the function is plotted in Figure 1(a). This, on the one hand, illustrates the
named as M1. Moreover, the combina- From Figure 2(b), we see that M3, critical importance of improving the
tion of M1 and the iterative WLS is re- which is based on the proposed local initialization for the iterative WLS, a
ferred to as M2 in Table 1. averaging in (26), achieves an obviously main motivation of this work. On the
better n estimation performance than other hand, this validates our success
Simulation results M1, which is based on the naive method in developing a high-quality initial-
Simulation results are presented next given in (12). From Figure 2(c), we see ization for the iterative WLS.
to illustrate the performance of the five that M3 substantially outperforms M1. In spite of the random changing
Gaussian fitting methods summarized This validates the competency of the pro- of n and v over 105 independent tri-
in Table 1. The MATLAB simulation posed v estimation scheme for the cases als, Figure 2 shows that our proposed
codes for generating Figures  2 and 3 with the Gaussian function (to be fitted) initial fitting enables WLS to achieve

Table 1. A summary of the simulated methods, where the running time of each method is averaged over 105 independent trials.

Code Name Method Fitting Steps Time (ns)


M1 [6] Estimate At and nt as done in (12), where nt leads to nt = nt d x; estimate vt using (11). 76.79
M2 [6] and [5] Stage 1: Run M1 first, getting initial At , nt , and vt . 275
Stage 2: Perform the iterative WLS based on (9), where w 0 = : Ae D.
(nd - nt d ) 2 T
t - 2vt x
2
x

n = 0, 1, f, N - 1

M3 New Estimate nt based on (26); estimate At using (27); perform the estimators in (18), attaining v a and v b ; 304.61
combine the two estimates in the linear manner depicted in (22), where the optimal t, as approximately
calculated in (25), is used as the combination coefficient; and refine At as done in (28).
M4 New and [5] Stage 1: Run M3 first, getting initial At , nt , and vt . 502.81
Stage 2: This is the same as in M2.
M5 [5] Perform the iterative WLS based on (9). 1,009.72
The simulations are run in MATLAB R2021a installed on a computing platform equipped with the Intel Xeon Gold 6238 R 2.2-GHz 38.5-MB L3 Cache (maximum turbo
frequency: 4 GHz; minimum: 3 GHz).

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 81


100 100 100

10–1
10–1 10–1
MSE

MSE

MSE
10–2
10–2 10–2

10–3
10–3 10–3

10–4
10 15 20 10 15 20 10 15 20
A2/σξ2 (dB) A2/σξ2 (dB) A2/σξ2 (dB)
(a) (b) (c)

M1 M2 M3 M4 M5

FIGURE 2. The MSEs of fitting results versus the SNR in the sampled Gaussian function with A = 1, where n and v are randomly generated based on
t , (b) nt , and (c) vt . Note that v 2p denotes the power of the noise term p [n] given in (4). The MSE is
the uniform distributions given in Table 2, for (a) A
5
calculated over 10 trials, each with independently generated and normally distributed noise.

100 100 100

10–1 10–1
10–1
MSE

MSE

MSE

10–2
10–2
10–2

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Number of Iterations Number of Iterations Number of Iterations
(a) (b) (c)

M1 M2 M3 M4 M5

FIGURE 3. The MSEs of fitting results versus the number of iterations used in M5 and the second stage of M2/M4 for (a) At , (b) nt , and (c) vt , where 
A = 1, A 2 /v p2 is 12 dB, and n and v are randomly generated based on the uniform distributions given in Table 2. The MSE is calculated over 105
trials, each with independently generated and normally distributed noise.

82 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


consistently better and more stable per-
Table 2. The simulation parameters.
formance than prior arts. This suggests
that we have successfully relieved the Variable Description Value
dependence of the iterative WLS on the A A parameter of the Gaussian function determining its 1
proportion of the tail region of a Gauss- maximum amplitude [see (1)]
ian function. In contrast, as illustrated n A parameter of the Gaussian function indicating its U [8,9]
peak location [see (1)]
in the “Iterative WLS Fitting” and “An
v A parameter of the Gaussian function indicating its U [1,1.3]
Interesting but Not-Good-Enough Ini- width of the principal region [see Figure 1(c)]
tialization” sections, the performance x Function variable [0,10]
of M5 and M1 can be subject to how dx Sampling interval of x [see (4)] 0.01
complete a Gaussian function is sam- k Intermediate variable used in the proposed estimators 0.1 : 0.01 : 10
given in (18)
pled. M2, which is based on M1, also
L Windows size for estimating n as done in (26) 3
has the dependence.
Number of iterations for M5 12
Figure 3 shows the MSEs of the five Number of iterations in stage 2 of M2/M4 2
estimators listed in Table 1 against the v 2p Power of the additive noise p [n] given in (4) −10 : 0.5 : 20 dB
number of iterations of the iterative U [a, b] denotes the uniform distribution in [a, b] .
WLS, performed in the second stage
of M2/M4 and M5. For all three pa-
rameters, we can see that the proposed ■■ We also improve the estimation of His Xidian Ph.D. won the Excellent
initialization (M3) nontrivially outper- the peak amplitude of a Gaussian Ph.D. Thesis Award 2019 from the
forms the state-of-the-art M1. We also function by minimizing the MSE of Chinese Institute of Electronic
see that M4 approximately converges the initial Gaussian fitting. Engineering. His UTS Ph.D. was
after two iterations, while M2 and M5 Corroborated by simulation results, awarded The Chancellor’s List 2020.
present much slower convergence with the proposed initialization can sub- His research interests include array sig-
the MSE performance, inferior to M4 stantially improve the accuracy of the nal processing and its applications in
even after 12 iterations. These observa- iterative WLS-based linear Gaussian radar and communications. He is a
tions highlight the critical importance fitting, even in challenging scenarios Member of IEEE.
of a good initialization to the iterative with strong noises and the incompletely J. Andrew Zhang (andrew.zhang@
WLS, particularly when the sampled sampled Gaussian function with a long uts.edu.au) received his Ph.D. degree
Gaussian function is noisy and incom- tail. Notably, the performance improve- from the Australian National Univer-
plete with a long tail. They again vali- ment is also accompanied by improved sity in 2004. He is an associate pro-
date the effectiveness of the proposed fitting efficiency. fessor in the School of Electrical and
techniques, as enabled by the unveiled Data Engineering, University of
signal processing tricks, in these chal- Acknowledgment Technology Sydney, Sydney, NSW
lenging scenarios. We thank Prof. Rodrigo Capobianco 2007, Australia. He was a researcher
Guido, IEEE Signal Processing Maga- with Data61, Commonwealth Scientif-
Conclusions zine’s area editor for columns and fo- ic and Industrial Research Organiza-
In this article, we develop a high-quality rum, and Dr. Wei Hu, associate editor tion (CSIRO), Australia, from 2010 to
initialization method for the iterative for columns and forum, for managing 2016; Networked Systems, National
WLS-based linear Gaussian fitting algo- the review of our article. We also thank ICT Australia, Australia, from 2004 to
rithm. This is achieved by a few signal the editors and the anonymous review- 2010; and ZTE Corp., Nanjing, China,
processing tricks, as summarized here: ers for providing constructive sugges- from 1999 to 2001. He has published
■■ We introduce a simple local averag- tions to improve our work. We further more than 200 papers in leading inter-
ing technique that reduces the noise acknowledge the support of the Austra- national journals and conference pro-
impact on estimating the peak loca- lian Research Council under the Dis- ceedings and has won five best paper
tion of a Gaussian function, i.e., n. covery Project grant DP210101411. awards. He received the CSIRO Chair-
■■ We provide a more precise integral man’s Medal and the Australian Engi-
result that is approximated by the Authors neering Innovation Award in 2012 for
summation of the Gaussian function Kai Wu ([email protected]) received exceptional research achievements in
samples, which results in two esti- h i s P h . D . d eg r e e f r o m X i d i a n multigigabit wireless communications.
mates of v . University, China, in 2019 and his His research interests include the area
■■ We unveil the linear relation between Ph.D. degree from the University of of signal processing for wireless com-
the asymptotic performance of the Technology Sydney (UTS), Australia, munications and sensing. He is a Senior
two estimates and then design an in 2020. He is a research fellow at the Member of IEEE.
asymptotically optimal combination Global Big Data Technologies Centre, Y. Jay Guo ([email protected])
of these estimates. UTS, Sydney, NSW 2007, Australia. received his Ph.D. degree from Xian

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 83


Jiaotong University, Xi’an, China, in the top researchers in Australia in 2020 [4] R. A. Caruana, R. B. Searle, T. Heller, and S. I.
Shupack, “Fast algorithm for the resolution of spec-
1987. He is a distinguished professor and 2021. His research interests include tra,” Anal. Chem., vol. 58, no. 6, pp. 1162–1167,
with and the director of the Global Big antennas, millimeter-wave and terahertz 1986, doi: 10.1021/ac00297a041.
Data Technologies Centre at the communications and sensing systems, [5] H. Guo, “A simple algorithm for fitting a Gaussian
function [DSP Tips and Tricks],” IEEE Signal
University of Technology Sydney, and big data technologies. He is a Fellow Process. Mag., vol. 28, no. 5, pp. 134–137, 2011, doi:
Sydney, NSW 2007, Australia, and the of IEEE. 10.1109/MSP.2011.941846.
[6] I. Al-Nahhal, O. A. Dobre, E. Basar, C. Moloney,
technical director of the New South and S. Ikki, “A fast, accurate, and separable method
Wales Connectivity Innovation Network References for fitting a Gaussian function [Tips & Tricks],” IEEE
[1] A. D. Poularikas, Handbook of Formulas and Signal Process. Mag., vol. 36, no. 6, pp. 157–163,
Australia. He has won a number of pres- Tables for Signal Processing. Boca Raton, FL, USA: 2019, doi: 10.1109/MSP.2019.2927685.
tigious awards, including Australian CRC Press, 2018.
[7] S. M. Kay, Fundamentals of Statistical Signal
Engineering Excellence Awards (2007 [2] “What is a Gaussian function?” LogicPlum. Processing: Estimation Theory. Englewood Cliffs,
https://logicplum.com/knowledge-base/gaussian NJ, USA: Prentice-Hall, 1993.
and 2012) and the CSIRO Chairman’s -function/ (Accessed: Mar. 11, 2022).
[8] “Help center.” MathWorks. https://au.mathworks.
Medal (2007 and 2012). He was named [3] E. Kheirati Roonizi, “A new algorithm for fitting a com/help/matlab/ (Accessed: Mar. 11, 2022).
Gaussian function riding on the polynomial back-
one of the most influential engineers in ground,” IEEE Signal Process. Lett., vol. 20, no. 11,
Australia in 2014 and 2015 and one of pp. 1062–1065, 2013, doi: 10.1109/LSP.2013.2280577.  SP

FROM THE EDITOR  (continued from page 3)

at the Epicenter of Ground-Shaking If you have any publication ideas I encourage authors to use Code Ocean
Research,” reports research results of for SPM, I encourage you to contact facilities (https://innovate.ieee.org/ieee
three projects focused on various issues the area editors or me (see the editorial -code-ocean/) for this purpose.
related to earthquakes, from forecasting board in the SPM web pages) to discuss
to localization of victims. your idea. Remember that the articles References
[1] R. Couillet, D. Trystram, and T. Menissier, “The
Among these articles, many are us- in SPM are not suited for the publish- submerged part of the AI-Ceberg [Perspectives],”
ing, at least partly, machine learning ing of either new results or surveys. IEEE Signal Process. Mag., vol. 39, no. 5, pp. 10–17,
Sep. 2022, doi: 10.1109/MSP.2022.3182938.
methods. Although some are consider- They are tutorial-like articles that must
[2] “IEEE Panel of Editors 2022.” IEEE.tv.
ing sparsity priors, I believe that use- be comprehensive for a wide audience Accessed: Apr. 29, 2022. [Online]. Available: https://
fulness, pros and cons, cost of these and accompanied by a relevant selec- ieeetv.ieee.org/channels/communities/welcome-from
-panel-of-editors-chair-poe-2022-0
methods (which are very greedy in pow- tion of both figures and references as
[3] “IEEE Publication Services and Products Board
er, computation, and memory), could be Robert Heath, the previous SPM editor- Operations Manual 2021,” IEEE Publications,
discussed more thoroughly. Benchmarks in-chief explained very clearly in [4]. Piscataway, NJ, USA, 2022. [Online]. Available: https://
pspb.ieee.org/images/files/files/opsmanual.pdf
with more classical methods should use Concerning the “Lecture Notes” and
[4] R. W. Heath, “Reflections on tutorials and sur-
metrics that are able to take into account Tips & Tricks” columns, I also think veys [From the Editor],” IEEE Signal Process. Mag.,
at least these parameters and not only a that sharing data and codes would be an vol. 37, no. 5, pp. 3–4, Sep. 2020, doi: 10.1109/
MSP.2020.3006648.
simple performance index. actual added value to these articles, and SP


SP FORUM  (continued from page 75)

across a number of academic publishing ing at Institut National Polytechnique de neering, speech processing, hyperspec-
companies. He joined IEEE initially as Grenoble, France. He has been a profes- tral imaging, and chemical engineering.
an editor for IEEE Press before transi- sor since 1989 and an emeritus profes- Since 2019, he has been a scientific
tioning to the Intellectual Property sor since 2019 at Université Grenoble advisor for scientific integrity for the
Rights (IPR) Office as an IPR specialist. Alpes, Grenoble 38402 France. Since 1979, French National Center of Scientific
Christian Jutten (christian.jutten@ his research has focused on statistical Research and, since January 2021, edi-
grenoble-inp.fr) received his master’s signal processing and machine learning tor-in-chief of IEEE Signal Processing
and Ph.D. degrees in electrical engineer- with applications in biomedical engi- Magazine. He is a Fellow of IEEE.  SP

84 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


APPLICATIONS CORNER
Lorenzo Picinali , Brian FG Katz , Michele Geronazzo , Piotr Majdak,
Arcadio Reyes-Lecuona, and Alessandro Vinciarelli

The SONICOM Project: Artificial Intelligence-Driven


Immersive Audio, From Personalization to Modeling

E
very individual perceives spatial anisms of human spatial hearing and techniques, specifically looking at
audio differently, due in large part communication. customization and personalization of
to the unique and complex shape of the audio rendering.
ears and head. Therefore, high-quality, Introduction ■■ It will explore, map, and model how
headphone-based spatial audio should Immersive audio is what we experi- the physical characteristics of spatial-
be uniquely tailored to each listener in ence in our everyday life, when we can ized auditory stimuli can influence
an effective and efficient manner. Arti- hear and interact with sounds com- observable behavioral, physiological,
ficial intelligence (AI) is a powerful ing from different positions around us. kinematic, and psychophysical reac-
tool that can be used to drive forward We can simulate this interactive audi- tions of listeners within social inter-
research in spatial audio personalization. tory experience within virtual reality action scenarios.
The SONICOM project aims to employ (VR) and augmented reality (AR) using ■■ It will evaluate the techniques devel-
a data-driven approach that links phys- off-the-shelf components such as head- oped and data-driven outputs in an
iological characteristics of the ear to phones, digital signal processors, iner- ecologically valid manner, exploit-
the individual acoustic filters, which tial sensors, and handheld controllers. ing AR/VR simulations as well as
allows us to localize sound sources and Immersive audio technologies have the real-life scenarios.
perceive them as being located around potential to revolutionize the way we ■■ It will create an ecosystem for audi-
us. A small amount of data acquired interact socially within AR/VR envi- tory data closely linked with model
from users could allow personalized ronments and applications. But several implementations and immersive
audio experiences, and AI could facili- major challenges still need to be tackled audio-rendering components, rein-
tate this by offering a new perspective before we can achieve sufficiently high- forcing the idea of reproducible
on the matter. A Bayesian approach quality simulations and control. This research and promoting future devel-
to computational neuroscience and will involve not only significant tech- opment and innovation in the area of
binaural sound reproduction will be nological advancements but also mea- auditory-based social interaction.
linked to create a metric for AI-based suring, understanding, and modeling
algorithms that will predict realistic low-level psychophysical (sensory) as Overview
spatial audio quality. Being able to con- well as high-level psychological (social SONICOM involves an international
sistently and repeatedly evaluate and interaction) perception. team of 10 research institutions and cre-
quantify the improvements brought by Funded by the Horizon 2020 FET- ative tech companies from six European
technological advancements, as well Proact scheme, the SONICOM project countries, all active in areas such as im-
as the impact these have on complex (www.sonicom.eu) started in January mersive acoustics, AI, spatial hearing,
interactions in virtual environments, 2021 and, over the course of the next auditory modeling, computational social
will be key for the development of five years, will aim to transform audi- intelligence, and interactive comput-
new techniques and for unlocking new tory social interaction and commu- ing. The workplan is centered around
approaches to understanding the mech- nication in AR/VR by achieving the three pivotal research work packages
following objectives: titled “Immersion,” “Interaction,” and
■■ It will design a new generation of “Beyond,” each introduced in one of the
Digital Object Identifier 10.1109/MSP.2022.3182929
Date of current version: 27 October 2022 immersive audio technologies and following sections. The first looks at

1053-5888/22©2022 Canadian Crown Copyright IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 85
immersive audio challenges dealing with and can be described by head-related based framework for the numerical cal-
the technical and sensory perspectives. transfer functions (HRTFs), which can culation of HRTFs. On the other hand,
On the other hand, the second focuses on be acoustically measured (e.g., see Fig- we focus on HRTF database match-
the interaction between these and higher- ure 1) or numerically modeled (e.g., in ing, an approach based on the hypoth-
level sociopsychological implications. [1]). Everyone perceives sound differ- esis that individuals can be paired with
Finally, the integration of core research, ently due to the particular shape of their existing high-quality HRTF data sets
proof-of-concepts evaluations, and cre- ears and head. For this reason, high- (measured or modeled) as long as they
ation of the auditory data ecosystem en- quality simulations should be uniquely share some relevant, predefined char-
sures that various outputs of the project tailored to each individual, effectively acteristics in the perceptual features
will have an impact beyond the end of and efficiently. Within SONICOM we space. To this end, we will expand the
SONICOM (see the “Beyond” section). propose a data-driven approach link- procedures based on objective similar-
ing the physiological characteristics of ity measures [4] and subjective listener
Immersion the ear to the individual acoustic filters input [5], [6]. The said measures con-
Before reaching the listener’s eardrums, employed for perceiving sound sources cern geometrical variations for PPMs,
the acoustic field is filtered due to shad- in space. perceptual deviations of the computed
owing and diffraction effects by the Our HRTF modeling research con- HRTFs, and signal-domain similarities
listener’s body, in particular, the head, siders a variety of approaches. On the for HRTF matching, all referenced to a
torso, and outer ears. This natural fil- one hand, we focus on the creation of project database comprising geometri-
tering depends on the spatial relation- parametric pinna models (PPMs) [2], cal scans and associated HRTF mea-
ship between the source and the listener [3] and their application to create an AI- surements from a set of individuals.
Being able to consistently and repeat-
ably evaluate and quantify the improve-
ments brought by these technological
advancements will be absolutely key,
not only for the development of new
techniques but also for unlocking new
approaches to understanding the mecha-
nisms of human spatial hearing. Our
approach to personalization presents a
new perspective on this problem, linking
Bayesian theories of active inference [7],
[8] and binaural (i.e., related to both ears)
sound reproduction to create data sets of
human behavior and perceptually valid
metrics modeling such behavior. By hav-
ing acoustic simulations validated against
acoustic measurements, and human
auditory models validated against actual
behavior, we will provide important tools
for the development of AI-based predic-
tors of realistic spatial audio quality.
We will also concentrate on the
issue of blending virtual objects in
real scenes, which is one of the corner-
stones of AR. To blend the real with
the virtual worlds in an AR scenario,
it is essential to develop techniques for
automatic estimation of the reverberant
characteristics of the real environment.
This will be achieved by character-
izing the acoustical environment sur-
rounding the AR user. The extracted
data can then be employed to generate
realistic virtual reverberation matching
the real world. After a set of pilot stud-
FIGURE 1. The HRTF measurement setup at Imperial College London, U.K. [24]. ies looking at perceptual needs in terms

86 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


of reverberation processing (e.g., in [9]), example, it could enable a virtual pet to tics (SOFA) [21]. SOFA, a standard of
we will employ geometrical acoustics develop deeper intimacy with children the Audio Engineering Society (AES69-
and simplified computational models, by sounding physically closer or help a 2015, [22]), has received a recent
such as scattering delay networks [10], virtual character to appear less friendly upgrade and will be further extended
[11], to generate real-time simulations by sounding more distant. More gener- toward the needs of SONICOM and
of the real-world environment where the ally, it will be possible to modulate the its Ecosystem. A further component of
listener is located. perceived distance between VR users the Ecosystem will be Mesh2HRTF,
Finally, to provide ecologically valid and virtual agents according to roles the an application to numerically calculate
evaluations, studied settings will not be latter play within immersive and inter- HRTFs. Originally developed in 2015
limited to oversimplified, highly con- active environments. Although having [1], it will receive a major upgrade when
trolled, traditional “laboratory” condi- attracted significant attention in recent integrated into the Ecosystem.
tions, but will seek to extend the set of years, the use of perceived physical dis- Considering the timeline of the proj-
evaluation scenarios to better represent tance to interface VR technology with ect, the core research activities of the
a variety of real-world use cases for AR/ psychology and cognition of its users is “Immersion” and “Interaction” work
VR technology. Combining the desire still at a pioneering stage (see, e.g., [16] packages will progress until the first
for robust evaluation techniques with and [17]). Its investigation is a part of half of 2024, when the work on the
realistic use cases requires a delicate what is planned within the SONICOM SONICOM Framework and Ecosystem
balance of experimental design to pres- project and promises to be fruitful from will commence. These efforts will be
ent a real-world-like context while still both scientific and technological points preparatory to the launch of the listener
maintaining the required laboratory of view. acoustic personalization (LAP) chal-
controllability to obtain meaningful, lenge, opening up to researchers across
exploitable data (e.g., in [12]). Beyond the world to contribute and compete,
Ensuring that the project’s accomplish- within various scenarios and tasks, with
Interaction ments (algorithms, AI-based models, their state-of-the-art HRTF personal-
AR and VR technologies work by stim- evaluation data, and so on) remain avail- ization algorithms. The recently intro-
ulating the senses of their users and, as able to various stakeholders, including duced paradigm of the egocentric audio
a result, most of the previous research the wider research communities beyond perspective [23] will guide the defini-
has focused on reproducing the sensory SONICOM, is of primary importance. tion of effective evaluation measures
experience of the physical world. How- To facilitate this and consolidate all the considering the first-person point of
ever, this is not sufficient when a vir- developed tools within a common struc- view for embodied, environmentally
tual or augmented environment involves ture, the SONICOM Ecosystem will be situated perceivers with sensorimotor
interaction with other agents, whether created, which will include open source processes tightly connecting sensorim-
human or artificial. software modules implementing the vari- otor processes with exploratory actions.
For example, literature shows that a ous tools, algorithms, and models devel- A publicly released corpus will be an
behavioral cue (smile, gesture, sentence, oped within the project. integral part of the Ecosystem and will
and so on) stimulates the same range of The main part of the SONICOM include AI-driven data, behavioral and
unconscious reactions whether it is dis- Ecosystem will be the SONICOM HRTF data, and human body scans for
played by an artificial agent or by a per- Framework, consisting of the Binaural a set of listeners. Moreover, a range of
son [13]. Such a phenomenon, known as Rendering Toolbox (BRT), auditory real-life scenarios of increasing com-
a media equation [14], can be observed data and models, and dedicated hard- plexity will be captured by microphone
in AR/VR where, in the particular case of ware. The rendering core of the BRT is arrays and multimodal sensors to form
speech, it is possible to simulate differ- inspired by the work that has already the ground truth of objects and actions
ent distances between artificial speakers been done on the development of the 3D in the scenes. All of this is meant to
and listeners. This is important because Tune-In Toolkit [18]. It will be imple- simulate a digital replica of the com-
social and physical distances are deeply mented as an interchangeable module, plex listener reality system, which will
intertwined from a cognitive and psycho- allowing the use of numerous rendering allow us to create virtual/augmented
logical point of view [15]. Therefore, it engines for various software platforms, listening experience. Such an integra-
is possible to expect that VR users will connected with the rendering core mod- tion of SONICOM’s outputs aims to
tend to attribute different social charac- ules using interfaces and communication promote reproducible research, creating
teristics (intentions, personality traits, protocols. Once benchmarked and eval- a sustainable basis for further research
attitudes, and so forth) to speakers ren- uated, the SONICOM Framework will beyond SONICOM.
dered at different distances in the physi- become a part of the SONICOM Eco-
cal space. Besides being interesting from system, which will further include the Conclusions
a scientific point of view, such a phe- Auditory Model Toolbox [19], [20] and Although it is true that a large amount of
nomenon can contribute to the design of toolboxes dealing with HRTFs stored in research has been carried out in recent
personalized interaction experiences. For the spatially oriented format for acous- years looking at solutions to challenges

IEEE SIGNAL PROCESSING MAGAZINE | November 2022 | 87


that are similar to what we are tackling Brian FG Katz (brian.katz@sorbonne process theory,” Neural Comput., vol. 29, no. 1, pp.
1–49, Jan. 2017, doi: 10.1162/NECO_a_00912.
in SONICOM, there are transforma- -universite.fr) is with the Institute
[8] G. McLachlan, P. Majdak, J. Reijniers, and H.
tive elements within the research we d’Alembert, Sorbonne Université, Paris, Peremans, “Towards modelling active sound localisation
are planning, which could be the key 75252, France. His research interests based on Bayesian inference in a static environment,”
Acta Acust., vol. 5, p. 45, Oct. 2021, doi: 10.1051/
to creating a new generation of immer- include acoustics, HCI, virtual reality, aacus/2021039.
sive audio technologies and techniques. spatial perception, and room acoustics. [9] I. Engel, C. Henry, S. V. Amengual Garí, P. W.
One such element is the use of a data- Michele Geronazzo (michele.geronazzo Robinson, and L. Picinali, “Perceptual implications
of different Ambisonics-based methods for binaural
driven and AI-based approach to HRTF @unipd.it), Imperial College London reverberation,” J. Acoust. Soc. Amer., vol. 149, no. 2,
personalization, looking not only at the London, SW7 2AZ, and University pp. 895–910, 2021, doi: 10.1121/10.0003437.
[10] E. De Sena, H. Haciihabiboğlu, Z. Cvetković, and
physical nature of the problem (e.g., ear of Padova, Padova, 35122, Italy. His J. O. Smith, “Efficient synthesis of room acoustics via
morphology) but also at the perceptual research interests include binaural spatial scattering delay networks,” IEEE Trans. Audio, Speech,
Language Process.* (2006–2013), vol. 23, no. 9, pp.
side of things (e.g., listener preferences audio modeling and synthesis, and sound 1478–1492, 2015, doi: 10.1109/TASLP.2015.2438547.
and performances). The extensive use in multimodal virtual/augmented reality. [11] M. Geronazzo, J. Y. Tissieres, and S. Serafin, “A
of perceptual models also presents a Piotr Majdak (piotr.majdak@oeaw. minimal personalization of dynamic binaural synthe-
sis with mixed structural modeling and scattering
strong element of novelty, using existing ac.at) is with Acoustics Research delay networks,” in Proc. IEEE Int. Conf. Acoust.,
ones as a guide during the prototyping Institute, Austrian Academy of Sciences, Speech Signal Process. (ICASSP), 2 0 2 0 ,
pp. 411–415.
and design stages as well as for helping Wien, 31390, Austria. His research inter-
[12] D. Poirier-Quinot and B. F. G. Katz, “Assessing
to better understand the experimental ests include perceptual effect of the the impact of head-related transfer function individu-
research outputs. This will contribute HRTFs, their acoustic measurement, and alization on task performance: Case of a virtual reality
shooter game,” J. Audio Eng. Soc., vol. 68, no. 4, pp.
to create new and more accurate mod- numeric calculation. 248–260, 2020, doi: 10.17743/jaes.2020.0004.
els to be shared with the wider research Arcadio Reyes-Lecuona (areyes@ [13] A. Vinciarelli et al., “Bridging the gap between
communities. Within this context, the uma.es) is with the University of social animal and unsocial machine: A survey of
social signal processing,” IEEE Trans. Affective
attempt to make use of collected data to Malaga, Malaga, 29071, Spain. His Comput., vol. 3, no. 1, pp. 69–87, 2011, doi:
model social-level processing within research interests include 3D audio and 10.1109/T-AFFC.2011.27.
[14] B. Reeves and C. Nass, The Media Equation:
existing sensory models is a novel ele- HCI in VR, including sonic interaction. How People Treat Computers, Television, and New
ment and will enable better prediction Alessandro Vinciarelli (alessandro. Media Like Real People. Cambridge, MA, USA:
Cambridge Univ. Press, 1996.
of responses to complex tasks such as [email protected]) is with the
[15] E. Hall, The Silent Language. New York, NY,
speech-in-noise understanding and, School of Computing Science, University USA: Anchor Books, 1959.
more generally, sonic interactions with- of Glasgow, Glasgow, G128QQ, U.K. [16] I. Kastanis and M. Slater, “Reinforcement learn-
in AR/VR scenarios. His research interests include social ing utilizes proxemics: An avatar learns to manipulate
the position of people in immersive virtual reality,”
Finally, it seems clear that to ensure signal processing. ACM Trans. Appl. Perception, vol. 9, no. 1, pp. 1–15,
an adequate level of standardization and 2012, doi: 10.1145/2134203.2134206.

consistently advance the achievements References [17] J. Williamson, J. Li, V. Vinayagamoorthy, D.


[1] H. Ziegelwanger, W. Kreuzer, and P. Majdak, Shamma, and P. Cesar, “Proxemics and social inter-
of research in this area, a concerted and “Mesh2HRTF: An open-source software package for actions in an instrumented virtual reality workshop,”
coordinated effort across disciplines, the numerical calculation of head-related transfer in Proc. CHI Conf. Hum. Factor Comput. Syst., 2021,
functions,” in Proc. 22nd Int. Congr. Sound Vib., pp. 1–13, doi: 10.1145/3411764.3445729.
research institutions and industry play- 2015, pp. 1–8, doi: 10.13140/RG.2.1.1707.1128. [18] M. Cuevas-Rodríguez et al., “3D Tune-In
ers is absolutely essential, and this is [2] K. Pollack, P. Majdak, and H. Furtado, Toolkit: An open-source library for real-time binaural
“Evaluation of pinna point cloud alignment by spatialisation,” PLoS One, vol. 14, no. 3, p. e0211899,
precisely what we are trying to do with- Mar. 2019, doi: 10.1371/journal.pone.0211899.
means of non-rigid registration algorithms,” Audio
in SONICOM. Engineering Society, New York, NY, USA, May [19] P. Majdak, C. Hollomey, and R. Baumgartner,
2021. [Online]. Available: https://www.aes.org/e-lib/ “AMT 1.x: A toolbox for reproducible research in
browse.cfm?elib=21068 auditory modeling,” Acta Acust., vol. 6, p. 19, May
Acknowledgment [3] P. Stitt and B. F. G. Katz, “Sensitivity analysis of 2022, doi: 10.1051/aacus/2022011.
The SONICOM project has received pinna morphology on head-related transfer functions [20] P. Søndergaard and P. Majdak, “The auditory
simulated via a parametric pinna model,” J. Acoust.
funding from the European Union’s Soc. Amer., vol. 149, no. 4, pp. 2559–2572, 2021, doi:
modeling toolbox,” in The Technology of Binaural
Listening, J. Blauert, Ed. Berlin, Germany: Springer-
Horizon 2020 Research and Innovation 10.1121/10.0004128. Verlag, 2013, pp. 33–56.
Program under grant agreement number [4] A. Andreopoulou and A. Roginska, “Database [21] P. Majdak, F. Zotter, F. Brinkmann, J. De Muynke,
matching of sparsely measured head-related transfer M. Mihocic, and M. Noisternig, “Spatially oriented for-
101017743. functions,” J. Audio Eng. Soc., vol. 65, no. 7/8, pp. mat for acoustics 2.1: Introduction and recent advances,”
552–561, Jul. 2017, doi: 10.17743/jaes.2017.0021. J. Audio Eng. Soc., to be published.
Authors [5] B. F. G. Katz and G. Parseihian, “Perceptually [22] T. Ammermann et al., AES Standard for File
based head-related transfer function database optimi- Exchange - Spatial Acoustic Data File Format,
Lorenzo Picinali (l.picinali@imperial. zation,” J. Acoust. Soc. Amer., vol. 131, no. 2, pp. Standard AES69-2015, Audio Engineering Society,
ac.uk) is with Audio Experience Design, EL99–EL105, 2012, doi: 10.1121/1.3672641. New York City, NY, USA, Mar. 2015.
Imperial College London, London, SW7 [6] C. Kim, V. Lim, and L. Picinali, “Investigation [23] M. Geronazzo and S. Serafin, Eds. Sonic
into consistency of subjective and objective perceptual Interactions in Virtual Environments, Human–
2AZ, U.K. His research interests selection of non-individual head-related transfer func- Computer Interaction Series, 1st ed. Cham: Springer
include spatial acoustics and immersive tions,” J. Audio Eng. Soc., vol. 68, no. 11, pp. 819– International Publishing, 2022.
831, 2020, doi: 10.17743/jaes.2020.0053.
audio, perceptual hearing training, and [7] K. Fr iston, T. FitzGerald, F. R igoli, P.
[24] Audio Experience Design. Accessed: Dec. 20,
2021. [Online]. Available: axdesign.co.uk
ecoacoustic monitoring. Schwartenbeck, and G. Pezzulo, “Active inference: A SP

88 IEEE SIGNAL PROCESSING MAGAZINE | November 2022 |


DATES AHEAD
Please send calendar submissions to:
Dates Ahead, Att: Samantha Walter, Email: [email protected]

Editor’s Note
Due to changing situations around
the world because of the corona-
virus (COVID-19) pandemic,
please double-check each confer-
ence’s website for the latest news
and updates.

©SHUTTERSTOCK.COM/G/SUNSINGER
2022
NOVEMBER
IEEE Workshop on Signal Processing
Systems (SiPS) The International Symposium of Biomedical Imaging will be held in Cartegena de Indias, Columbia,
2–4 November, Rennes, France. 18–21 April 2023.
General Chairs: John McAllister and Maxime Pelcat
URL: https://sips2022.insa-rennes.fr/

Asia-Pacific Signal and Information


Processing Association Annual Summit
and Conference (APSIPA ASC) 2023 IEEE Conference on Artificial
Intelligence (CAI)
7–10 November, Chiang Mai, Thailand. 7–8 June, Santa Clara, CA, USA.
General Chairs: Nipon Theera-Umpon, Kosin
JANUARY General Chairs: Piero Bonissone
Chamnongthai, Toshihisa Tanaka, Anthony Kuh, and Gary Fogel
and Kenneth Kin-Man Lam IEEE Spoken Language Technology URL: https://cai.ieee.org/2023/
URL: https://www.apsipa2022.org/ Workshop (SLT)
9–12 January, Doha, Qatar.
18th IEEE International Conference on General Chairs: Ahmed Ali and JULY
Advanced Video and Signal-Based Bhuvana Ramabhadran
IEEE Statistical Signal Processing
Surveillance (AVSS) URL: https://slt2022.org/
Workshop (SSP)
29 November–2 December, Madrid, Spain. 2–5 July, Hanoi, Vietnam.
General Chair: Javier Ortega-Garcia
URL: http://atvs.ii.uam.es/avss2022/index.html
APRIL General Chairs: Karim Abed-Meraim and
Nguyen Linh Trung
International Symposium of URL: https://avitech.uet.vnu.edu.vn/ssp2023/
Biomedical Imaging
DECEMBER 18–21 April, Cartagena de Indias, Colombia.
Picture Coding Symposium (PCS) General Chairs: Natasha Lepore and Oscar Acosta SEPTEMBER
7–9 December, San Jose, CA, USA. URL: https://biomedicalimaging.org/2023/
31st European Signal Processing
General Chairs: Shan Liu and Antonio Ortega Conference (EUSIPCO)
URL: https://2022.picturecodingsymposium.org/
JUNE 4–8 September, Helsinki, Finland.
General Chairs: Esa Ollila and
IEEE International Workshop IEEE International Conference on Acoustics, Sergiy A. Vorobyov
on Information Forensics & Security Speech, and Signal Processing (ICASSP) URL: http://eusipco2023.org/
12–16 December, Shanghai, China. 4–9 June, Rhode Island, Greece.
General Chairs: Rémi Cogranne General Chairs: Petros Maragos,

and Xinpeng Zhang Kostas Berberidis, and Petros Boufounos
URL: https://wifs2022.utt.fr/ URL: https://2023.ieeeicassp.org SP

Digital Object Identifier 10.1109/MSP.2022.3198300


Date of current version: 27 October 2022
MATLAB SPEAKS
MACHINE
LEARNING
With MATLAB® you can use clustering,
regression, classification, and deep
learning to build predictive models
and put them into production.

mathworks.com/machinelearning

©2022 The MathWorks, Inc.

You might also like