SlideShare a Scribd company logo
Aldebaro Klautau 1
ITU Artificial Intelligence/Machine Learning in 5G Challenge
Radio-Strike: A Reinforcement Learning
Game for MIMO Beam Selection in
Unreal Engine 3-D Environments
Aldebaro Klautau
Federal University of Pará (UFPA) / LASSE
http://ai5gchallenge.ufpa.br
July 02, 2021
Joint work with Prof. Francisco Müller (UFPA)
and several students
UFPA
Federal University of Pará
• Established in 1957
• Largest academic and research institution
in the Amazon (Pará state in Northern
Brazil)
• One of the largest Brazilian universities
with total population (students + staff) of
~60k people
• One of the missions is the sustainable
development of the region through
science and technology
3
Agenda
Motivation
Beam selection
Radio Strike
Reinforcement learning concepts (brief)
Problem ITU-ML5G-PS-006 reinforcement learning
4
Part I - Motivation
5
Machine learning for communications:
importance to industry
Standardization bodies
discussing AI / ML
ITU
Architectural
Framework
for ML
Rec. Y.3172
Network
automation
and resource
adaptation
Rec. Y.3177
3GPP
Network Data
Analytics
Function
(NWDAF)
TR 23.791
Analytics in
5G Core
TS 23.288
ETSI
Experiential
Networked
Intelligence
(ENI)
Zero-Touch
Network and
Service
Management
(ZSM)
Linux
Foundation
AI in Open
Network
Platform
(ONAP)
ML and open
data
platforms
O-RAN
RAN
Intelligent con
troller (RIC)
and Near-RT
RIC
Data-driven
workflows for
closed-control
loops
Machine learning for communications
(ML4COMM) still faces the small data regime
7
Amount of data
Performance
Deep learning
most others
small data regime
Models are relatively small and
ML4COMM has yet to escape small
data regime
Data scarcity is an issue. Problem of
traditional cycle: high cost of
measurements when using high
frequencies and multiple antennas
For instance, reinforcement learning
(RL) agents applied to
communications typically have a
small action space dimension
Key alternative for speeding up ML4COMM:
use simulations to generate large datasets
8
https://www.itu.int/en/ITU-T/academia/kaleidoscope/2021/Pages/default.aspx
Virtual Reality in Real Time: FastNeRF
accelerates photorealistic 3D rendering via
Neural Radiance Fields (NeRF) to visualize
scenes at 200 frames per second
https://arxiv.org/abs/2103.10380v2
We will run simulations much faster than real-time
9
1957 1980 1986 1989 1994 2006 2011 2012
[Rosenblatt]
“Perceptron”
[Fukushima]
“Convolutional”
layers
[Rumelhart
et al]
“Multilayer
Perceptron”
(MLP)
[Hinton et al]
Deep semi-
supervised
belief nets
Algorithms
Historical evolution of neural networks
applied to speech recognition
Speech recognition
[Renals et al]
HMM/MLP, 69
outputs, 1
hidden layer,
300k parameters
[Seide et al]
SWBD-1
breakthrough:
9304 outputs, 7
hidden layers,
15M parameters
TI / MIT (TIMIT)
dataset was
released
From (detailed) TIMIT to large (SWBD) datasets
10
Word
error
rate
(WER)
Breakthrough:
Switchboard-1 (SWBD-1) dataset
Hidden Markov models (HMMs)
with Gaussian mixtures (GMMs)
HMMs with deep
convolutional nets
LSTM + tricks
• TIMIT dataset has detailed time-aligned orthographic and phonetic transcriptions
• In 1986, took 100 to 1000 hours of work to transcribe each hour of speech
• Project cost over 1 million dollars
• Five phoneticians agreed on 75% to 80% of cases
When speech
recognition
reached the large
data regime
Simulating communication systems + AI + VR / AR
6G systems are expected to support applications such as augmented
reality, multisensory communications and high fidelity holograms. This
information will flow through the network. It is expected that 6G
systems will use ML/AI to leverage such multimodal data and optimize
performance
11
This requires a simulation environment that is capable not only of
generating communication channels, but also the corresponding sensor
data, matched to the scene
Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf
ITU-ML5G-PS-006-RL: Communication networks and Artificial
intelligence immersed in VIrtual or Augmented Reality (CAVIAR)
CAVIAR: get “measurements” on virtual worlds
12
Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf
Part II – Beam selection
13
Improving communications with antenna arrays
14
Wavelength l=c/f
l=5 mm when f=60 GHz
Space between
antenna elements = l/2
Array form factor decreases
when frequency increases
(mmWave in 5G / THz in 6G)
THz
mmWave
Illustrative radiation patterns of an array:
One antenna
2 antennas 36 antennas
Beam
Given a phased antenna array, we choose a
“beamvector” to impose a radiation pattern
Beam selection in 5G mmWave
RF
Chain
Phase shifters
specified by
codebooks
RFain
DAC Baseband
Baseband RF
Chain
RFain
ADC
Analog beamforming Analog combining
Align the beams of transmitter and receiver
Brute force to find best: try all possible Mt x Mr pairs of indices
Wireless channel H
Codebook with
Mt vectors f
Codebook with
Mr vectors w
[1] Heath et al, An Overview of Signal Processing Techniques for Millimeter Wave MIMO Systems, 2016
Transmitter (Tx) Receiver (Rx)
𝑦 𝑖, 𝑗 = |𝒘𝑗
𝐻
𝑯𝒇𝑖|
Goal: maximize
ML-based beam selection in 5G mmWave:
often modeled as supervised learning
16
Tx codebook Rx codebook
Index
i=1
2
Pair or single index
j=1
2
(1, 1)
(1, 2)
(2,1)
(2,2)
0
1
2
3
Index
Inputs from
communication
system and also
from sensors
such as GPS
Example with
two beamvectors
per codebook
Typically posed
as a classification
problem. We will
assume RL
Part III – Radio Strike
A Reinforcement Learning Game for MIMO Beam Selection in Unreal
Engine 3-D Environments
17
Reinforcement learning with OpenAI Gym
18
A
S R
Environ.
Reward
Action
Environment
State
Goal: Find a policy that maximizes the return
over a lifetime (episode, if not a continuing task)
https://gym.openai.com/
https://www.slideshare.net/a4aleem/reinforcement-learning-using-openai-gym
We adopt the popular OpenAI Gym API
19
After training the RL agent:
Using random actions:
Similarly, we want to choose the beam
and maximize performance with
respect to throughput and packet loss
Aldebaro Klautau 20
Problem: scheduling
and beam-selection
in downlink
Position,
speed, etc.
RL agent
Reinforcement learning for beam selection:
RadioStrike-noRT
-v1
21
A
S R
Environ.
The RL agent is executed at a base station (BS) with an antenna array and serves single-antenna
users on downlink using an analog MIMO architecture: pedestrian, drone and car
Action: at each time slot the
agent action schedules one user
and chooses the beam index to
serve this user
Reward: normalized throughput with a penalty for dropped packets
State (or observation): position
and buffer status of each user,
previously scheduled users, etc.
Return (in the end of the episode): sum of rewards
The state is defined by
the participant, as well
as eventual “intrinsic”
rewards
ITU-ML5G-PS-006:
research questions and strategies
22
Some questions:
- When performing user scheduling and beam selection, does position information
help the scheduler?
- Can we benefit from knowing the positions of scatterers?
From experience with 2020 Challenge:
- Help participants with the (eventually steep) learning curve
- Besides the main problem, discuss related simpler tasks and provide support
Keep evolving:
- Build together increasingly difficult CAVIAR “games”
- Create benchmarks for realistic applications of RL in 5G and 6G
Several specialized tools, besides the ones for reinforcement learning
Strategy 1: Provide guidance with the setup
Qualcomm’s AI Model
Efficiency Toolkit
Deployment frameworks: facilitate pruning the models and quantizing the weights for acceleration
Auxiliary tools for (shallow) machine learning, debugging, assessing models and running on cloud
Most used language Google’s, TF versions 1 and 2, with high level Keras API
Facebook’s
Other tools: NVIDIA,
Intel, etc.
Tensorflow Lite
& PyTorch Quantization
It may not be trivial to set up your development workflow
23
&
Strategy 2: Share simple baseline code
24
Aldebaro Klautau 25
Strategy 3: Postpone using ray-tracing and
adopt simple MIMO channel estimation
26
Future
environment
Strategy 4: provide support to two beam
selection environments
27
Base station
User
Scatterer
RadioStrike-noRT-v1
(PS-006 ITU Challenge)
MimoRL-simple-1-v0
(easier to start with)
Both have the basic elements:
ITU-ML5G-PS-006-RL: challenge, learning environment and
framework for building future CAVIAR simulations
Concepts of tabular reinforcement learning
28
Q(s,a) values
[1] Sutton’s & Barto’s book. Reinforcement learning: an introduction.
Policy: what to do.
Maps states in actions
LOS
NLOS
Strategy to get a policy:
find the “value” of a
state/action pair, its long-
term return
Easier to visualize:
grid-world example
(reach a pink corner)
Q-table
= 128
= 5184
OpenAI gym environment
MimoRL-simple-1-v0
Multi-armed bandits (MAB) are
simpler RL in which the action
influences the reward but not
the “state”
29
Policy versus Q-value in simpler 4 x 4 grid
Q-values for
optimal policy Optimal policy
[1] Sutton’s & Barto’s book. Reinforcement learning: an introduction (Example 4.1)
[2] https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter04/grid_world.py
Rewards
The Q-value is the long-term expected
return, not the immediate reward
Goal is to reach one of the pink corners
The reward is -1
everywhere
Policy can be based
on the Q(s,a) table.
Learn the table first.
DQN: From tabular method to deep RL
30
Input:
state
Q-value
estimates
(linear activation)
Another advantage of a NN instead of a table:
the state space (input) can be continuous (real
numbers) [1] Mnih et al, Playing Atari with Deep Reinforcement Learning, 2013
Q table: expected
long-term return
Table can
become too
large! 
Then use a
neural
network [1]
Online learning, no need for
output labels. Support to
delayed reward
Find the balance between
explore and exploit
Environment:
•Probabilistic / deterministic
•Stationary / non-stationary
•Full / partial state
observability
Need reward engineering
Another class of algorithms:
Policy Gradient
31
Policy gradient methods: the NN output is a policy, not Q-value estimates.
Supports stochastic policies.
Input:
state
Softmax
activation:
distribution
over actions
State (input) and action spaces (output) can be continuous (real numbers)
Example: an RL agent that allocates power
(as real numbers) in cell-free MIMO requires
a continuous action space
Cell-free MIMO
Discrete action example Continuous action example
Input:
state
Activations
for Gaussian
means and
variances
Summary of RL Methods
32
Input:
state
Distribution or
means/variances
of actions
Input:
state
Q-values
estimates
Policy gradient methods:
Deep Q-network (DQN):
Actor-critic (e.g.A3C): uses 2 NNs, Critic estimates Q-values and Actor the policy
Tabular methods (no neural network):
In all NN-based cases: # outputs neurons = # actions. PS-006 has a small # actions
How is the ITU-ML5G-PS-006-RL simulation
performed?
33
Base
station
serving a
drone
Simulation block diagram
34
If one wants to avoid executing Unreal/Airsim
35
ITU-ML5G-PS-006-RL code and associated files
https://github.com/lasseufpa/ITU-Challenge-ML5G-PHY-RL
Aldebaro Klautau 36
Steps to prepare the environment and run the baseline
Install package manager, e.g., Conda
Create environment
Activate the environment and install
the packages
We used Stables-Baselines 2.10 for our RL agent, so we needed
Tensorflow 1.14 and Python 3.6
Run RL agent train/test
Train: $ python3 train_agent.py ‘agent_name’ ‘train_episode’
outputs: ./model/’agent_name’.a2c
Test: $ python3 test_agent.py ‘agent_name’ ‘test_episode’
outputs: ./data/actions_’agent_name’.csv
37
Data organization
38
CSV text file corresponding to an episode:
Episode example (complete information
about the scene)
This information does not depend on the RL agent actions and can be pre-
computed. The buffer status can be used as input to the agent but need to be
retrieved along the execution
39
Sampling interval Ts = 10 milliseconds
Average episode duration = 3 minutes
…
Input data for baseline RL agent
40
Only data from users
(discard scatterers):
uav1, simulation_car2,
simulation_pedestrian4
Timeline
41
Thanks to all
ITU-ML5G-PS-006
reinforcement learning
team
42
Join the challenge!
43

More Related Content

What's hot (20)

PDF
Modelling power consumption femtocell
Youmni Ziadé
 
PDF
Orchestration of Ethernet Services in Software-Defined and Flexible Heterogen...
ADVA
 
PPTX
RF Antenna Planning
Mohammed Abuibaid
 
PDF
DESIGN OF A COMPACT CIRCULAR MICROSTRIP PATCH ANTENNA FOR WLAN APPLICATIONS
pijans
 
PDF
Performance evaluation of VLC system using new modulation approach
journalBEEI
 
PDF
A Novel Routing Strategy Towards Achieving Ultra-Low End-to-End Latency in 6G...
IJCNCJournal
 
PDF
EFFICIENT APPROACH FOR DESIGNING A PROTOCOL FOR IMPROVING THE CAPACITY OF ADH...
IJCI JOURNAL
 
PDF
Energy Efficiency of MIMO-OFDM Communication System
IJERA Editor
 
PDF
IRJET- Performance Analysis of IP Over Optical CDMA System based on RD Code
IRJET Journal
 
PDF
K017426872
IOSR Journals
 
PDF
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET Journal
 
PDF
ADAPTIVE SENSOR SENSING RANGE TO MAXIMISE LIFETIME OF WIRELESS SENSOR NETWORK
IJCNCJournal
 
PDF
IRJET- MIMO-Energy Efficient and Spectrum Analysis using Congnitive Radio Tec...
IRJET Journal
 
PDF
IRJET- DOE to Minimize the Energy Consumption of RPL Routing Protocol in IoT ...
IRJET Journal
 
PDF
Cellular expert for lte planning 2
Motti Markovitz
 
PDF
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions www.ijeijournal.com
 
PDF
IRJET- Statistical Tuning of Hata Model for 3G Communication Networks at 1.85...
IRJET Journal
 
PDF
Interoperator Dynamic Spectrum Sharing (Analysis, Costs and Implications)
CSCJournals
 
PDF
Sparse Spectrum Sensing in Infrastructure-less Cognitive Radio Networks via B...
Mohamed Seif
 
PDF
COMPARATIVE PERFORMANCE ASSESSMENT OF VBLAST ENCODED 8×8 MIMO MC-CDMA WIRELES...
pijans
 
Modelling power consumption femtocell
Youmni Ziadé
 
Orchestration of Ethernet Services in Software-Defined and Flexible Heterogen...
ADVA
 
RF Antenna Planning
Mohammed Abuibaid
 
DESIGN OF A COMPACT CIRCULAR MICROSTRIP PATCH ANTENNA FOR WLAN APPLICATIONS
pijans
 
Performance evaluation of VLC system using new modulation approach
journalBEEI
 
A Novel Routing Strategy Towards Achieving Ultra-Low End-to-End Latency in 6G...
IJCNCJournal
 
EFFICIENT APPROACH FOR DESIGNING A PROTOCOL FOR IMPROVING THE CAPACITY OF ADH...
IJCI JOURNAL
 
Energy Efficiency of MIMO-OFDM Communication System
IJERA Editor
 
IRJET- Performance Analysis of IP Over Optical CDMA System based on RD Code
IRJET Journal
 
K017426872
IOSR Journals
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET Journal
 
ADAPTIVE SENSOR SENSING RANGE TO MAXIMISE LIFETIME OF WIRELESS SENSOR NETWORK
IJCNCJournal
 
IRJET- MIMO-Energy Efficient and Spectrum Analysis using Congnitive Radio Tec...
IRJET Journal
 
IRJET- DOE to Minimize the Energy Consumption of RPL Routing Protocol in IoT ...
IRJET Journal
 
Cellular expert for lte planning 2
Motti Markovitz
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions www.ijeijournal.com
 
IRJET- Statistical Tuning of Hata Model for 3G Communication Networks at 1.85...
IRJET Journal
 
Interoperator Dynamic Spectrum Sharing (Analysis, Costs and Implications)
CSCJournals
 
Sparse Spectrum Sensing in Infrastructure-less Cognitive Radio Networks via B...
Mohamed Seif
 
COMPARATIVE PERFORMANCE ASSESSMENT OF VBLAST ENCODED 8×8 MIMO MC-CDMA WIRELES...
pijans
 

Similar to 2021 itu challenge_reinforcement_learning (20)

PPTX
ARAVIND.A.W (3).pptxyfdyffdjgfd8iuyf;uyf
nayarpuppy
 
PDF
The impact of Machine learning technology on the evolution of 5G and beyond c...
dauudahmed101
 
PDF
Deep Learning for 5G Innovation Insights from Patents
Alex G. Lee, Ph.D. Esq. CLP
 
PDF
Presentation Sample of machine learning.pdf
farzul09
 
PDF
INFOCOM CNERT 2018 - The CityLab Testbed - Large-scale Multi-technology Wirel...
Johann Marquez-Barja
 
PDF
Art%3 a10.1186%2f1687 6180-2011-29
Ishtiaq Ahmad
 
PDF
Horizon: Deep Reinforcement Learning at Scale
Databricks
 
PDF
Simplifying AI for Communications, Radar, and Wireless Systems
John Ferguson
 
PDF
Reconfigurable intelligent surface passive beamforming enhancement using uns...
IJECEIAES
 
DOCX
Patent application form
Mirza Baig
 
DOCX
AI enabled 5G.docx.docx
Sun Technologies
 
PDF
Introduction to reinforcement learning
Xander Steenbrugge
 
PDF
Autonomous Systems for Optimization and Control
Ivo Andreev
 
PDF
The essential role of AI in the 5G future
Qualcomm Research
 
PDF
RL presentation
Niloofar Sedighian
 
PDF
Autonomous Control AI Training from Data
Ivo Andreev
 
PDF
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Corre...
MohammedAlloulah
 
PDF
Tutorial @ IEEE ICC 2019 : Machine Learning and Stochastic Geometry: Statisti...
Takayuki Nishio
 
PPTX
Link adaptation and Adaptive coding,modulation system
DILSHAD AHMAD
 
ARAVIND.A.W (3).pptxyfdyffdjgfd8iuyf;uyf
nayarpuppy
 
The impact of Machine learning technology on the evolution of 5G and beyond c...
dauudahmed101
 
Deep Learning for 5G Innovation Insights from Patents
Alex G. Lee, Ph.D. Esq. CLP
 
Presentation Sample of machine learning.pdf
farzul09
 
INFOCOM CNERT 2018 - The CityLab Testbed - Large-scale Multi-technology Wirel...
Johann Marquez-Barja
 
Art%3 a10.1186%2f1687 6180-2011-29
Ishtiaq Ahmad
 
Horizon: Deep Reinforcement Learning at Scale
Databricks
 
Simplifying AI for Communications, Radar, and Wireless Systems
John Ferguson
 
Reconfigurable intelligent surface passive beamforming enhancement using uns...
IJECEIAES
 
Patent application form
Mirza Baig
 
AI enabled 5G.docx.docx
Sun Technologies
 
Introduction to reinforcement learning
Xander Steenbrugge
 
Autonomous Systems for Optimization and Control
Ivo Andreev
 
The essential role of AI in the 5G future
Qualcomm Research
 
RL presentation
Niloofar Sedighian
 
Autonomous Control AI Training from Data
Ivo Andreev
 
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Corre...
MohammedAlloulah
 
Tutorial @ IEEE ICC 2019 : Machine Learning and Stochastic Geometry: Statisti...
Takayuki Nishio
 
Link adaptation and Adaptive coding,modulation system
DILSHAD AHMAD
 
Ad

Recently uploaded (20)

PDF
MRI Tool Kit E2I0500BC Plus Presentation
Ing. Ph. J. Daum GmbH & Co. KG
 
PDF
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 
PPTX
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PDF
Farm Machinery and Equipments Unit 1&2.pdf
prabhum311
 
PPTX
Ground improvement techniques-DEWATERING
DivakarSai4
 
PDF
Non Text Magic Studio Magic Design for Presentations L&P.pdf
rajpal7872
 
PDF
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PDF
Web Technologies - Chapter 3 of Front end path.pdf
reemaaliasker
 
PDF
Natural Language processing and web deigning notes
AnithaSakthivel3
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PDF
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
PPTX
NEBOSH HSE Process Safety Management Element 1 v1.pptx
MohamedAli92947
 
PDF
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
MRI Tool Kit E2I0500BC Plus Presentation
Ing. Ph. J. Daum GmbH & Co. KG
 
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
Farm Machinery and Equipments Unit 1&2.pdf
prabhum311
 
Ground improvement techniques-DEWATERING
DivakarSai4
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
rajpal7872
 
The Complete Guide to the Role of the Fourth Engineer On Ships
Mahmoud Moghtaderi
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Web Technologies - Chapter 3 of Front end path.pdf
reemaaliasker
 
Natural Language processing and web deigning notes
AnithaSakthivel3
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
Zero Carbon Building Performance standard
BassemOsman1
 
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
NEBOSH HSE Process Safety Management Element 1 v1.pptx
MohamedAli92947
 
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
Ad

2021 itu challenge_reinforcement_learning

  • 2. ITU Artificial Intelligence/Machine Learning in 5G Challenge Radio-Strike: A Reinforcement Learning Game for MIMO Beam Selection in Unreal Engine 3-D Environments Aldebaro Klautau Federal University of Pará (UFPA) / LASSE http://ai5gchallenge.ufpa.br July 02, 2021 Joint work with Prof. Francisco Müller (UFPA) and several students
  • 3. UFPA Federal University of Pará • Established in 1957 • Largest academic and research institution in the Amazon (Pará state in Northern Brazil) • One of the largest Brazilian universities with total population (students + staff) of ~60k people • One of the missions is the sustainable development of the region through science and technology 3
  • 4. Agenda Motivation Beam selection Radio Strike Reinforcement learning concepts (brief) Problem ITU-ML5G-PS-006 reinforcement learning 4
  • 5. Part I - Motivation 5
  • 6. Machine learning for communications: importance to industry Standardization bodies discussing AI / ML ITU Architectural Framework for ML Rec. Y.3172 Network automation and resource adaptation Rec. Y.3177 3GPP Network Data Analytics Function (NWDAF) TR 23.791 Analytics in 5G Core TS 23.288 ETSI Experiential Networked Intelligence (ENI) Zero-Touch Network and Service Management (ZSM) Linux Foundation AI in Open Network Platform (ONAP) ML and open data platforms O-RAN RAN Intelligent con troller (RIC) and Near-RT RIC Data-driven workflows for closed-control loops
  • 7. Machine learning for communications (ML4COMM) still faces the small data regime 7 Amount of data Performance Deep learning most others small data regime Models are relatively small and ML4COMM has yet to escape small data regime Data scarcity is an issue. Problem of traditional cycle: high cost of measurements when using high frequencies and multiple antennas For instance, reinforcement learning (RL) agents applied to communications typically have a small action space dimension
  • 8. Key alternative for speeding up ML4COMM: use simulations to generate large datasets 8 https://www.itu.int/en/ITU-T/academia/kaleidoscope/2021/Pages/default.aspx Virtual Reality in Real Time: FastNeRF accelerates photorealistic 3D rendering via Neural Radiance Fields (NeRF) to visualize scenes at 200 frames per second https://arxiv.org/abs/2103.10380v2 We will run simulations much faster than real-time
  • 9. 9 1957 1980 1986 1989 1994 2006 2011 2012 [Rosenblatt] “Perceptron” [Fukushima] “Convolutional” layers [Rumelhart et al] “Multilayer Perceptron” (MLP) [Hinton et al] Deep semi- supervised belief nets Algorithms Historical evolution of neural networks applied to speech recognition Speech recognition [Renals et al] HMM/MLP, 69 outputs, 1 hidden layer, 300k parameters [Seide et al] SWBD-1 breakthrough: 9304 outputs, 7 hidden layers, 15M parameters TI / MIT (TIMIT) dataset was released
  • 10. From (detailed) TIMIT to large (SWBD) datasets 10 Word error rate (WER) Breakthrough: Switchboard-1 (SWBD-1) dataset Hidden Markov models (HMMs) with Gaussian mixtures (GMMs) HMMs with deep convolutional nets LSTM + tricks • TIMIT dataset has detailed time-aligned orthographic and phonetic transcriptions • In 1986, took 100 to 1000 hours of work to transcribe each hour of speech • Project cost over 1 million dollars • Five phoneticians agreed on 75% to 80% of cases When speech recognition reached the large data regime
  • 11. Simulating communication systems + AI + VR / AR 6G systems are expected to support applications such as augmented reality, multisensory communications and high fidelity holograms. This information will flow through the network. It is expected that 6G systems will use ML/AI to leverage such multimodal data and optimize performance 11 This requires a simulation environment that is capable not only of generating communication channels, but also the corresponding sensor data, matched to the scene Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf ITU-ML5G-PS-006-RL: Communication networks and Artificial intelligence immersed in VIrtual or Augmented Reality (CAVIAR)
  • 12. CAVIAR: get “measurements” on virtual worlds 12 Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf
  • 13. Part II – Beam selection 13
  • 14. Improving communications with antenna arrays 14 Wavelength l=c/f l=5 mm when f=60 GHz Space between antenna elements = l/2 Array form factor decreases when frequency increases (mmWave in 5G / THz in 6G) THz mmWave Illustrative radiation patterns of an array: One antenna 2 antennas 36 antennas Beam Given a phased antenna array, we choose a “beamvector” to impose a radiation pattern
  • 15. Beam selection in 5G mmWave RF Chain Phase shifters specified by codebooks RFain DAC Baseband Baseband RF Chain RFain ADC Analog beamforming Analog combining Align the beams of transmitter and receiver Brute force to find best: try all possible Mt x Mr pairs of indices Wireless channel H Codebook with Mt vectors f Codebook with Mr vectors w [1] Heath et al, An Overview of Signal Processing Techniques for Millimeter Wave MIMO Systems, 2016 Transmitter (Tx) Receiver (Rx) 𝑦 𝑖, 𝑗 = |𝒘𝑗 𝐻 𝑯𝒇𝑖| Goal: maximize
  • 16. ML-based beam selection in 5G mmWave: often modeled as supervised learning 16 Tx codebook Rx codebook Index i=1 2 Pair or single index j=1 2 (1, 1) (1, 2) (2,1) (2,2) 0 1 2 3 Index Inputs from communication system and also from sensors such as GPS Example with two beamvectors per codebook Typically posed as a classification problem. We will assume RL
  • 17. Part III – Radio Strike A Reinforcement Learning Game for MIMO Beam Selection in Unreal Engine 3-D Environments 17
  • 18. Reinforcement learning with OpenAI Gym 18 A S R Environ. Reward Action Environment State Goal: Find a policy that maximizes the return over a lifetime (episode, if not a continuing task) https://gym.openai.com/ https://www.slideshare.net/a4aleem/reinforcement-learning-using-openai-gym We adopt the popular OpenAI Gym API
  • 19. 19 After training the RL agent: Using random actions: Similarly, we want to choose the beam and maximize performance with respect to throughput and packet loss
  • 20. Aldebaro Klautau 20 Problem: scheduling and beam-selection in downlink Position, speed, etc. RL agent
  • 21. Reinforcement learning for beam selection: RadioStrike-noRT -v1 21 A S R Environ. The RL agent is executed at a base station (BS) with an antenna array and serves single-antenna users on downlink using an analog MIMO architecture: pedestrian, drone and car Action: at each time slot the agent action schedules one user and chooses the beam index to serve this user Reward: normalized throughput with a penalty for dropped packets State (or observation): position and buffer status of each user, previously scheduled users, etc. Return (in the end of the episode): sum of rewards The state is defined by the participant, as well as eventual “intrinsic” rewards
  • 22. ITU-ML5G-PS-006: research questions and strategies 22 Some questions: - When performing user scheduling and beam selection, does position information help the scheduler? - Can we benefit from knowing the positions of scatterers? From experience with 2020 Challenge: - Help participants with the (eventually steep) learning curve - Besides the main problem, discuss related simpler tasks and provide support Keep evolving: - Build together increasingly difficult CAVIAR “games” - Create benchmarks for realistic applications of RL in 5G and 6G
  • 23. Several specialized tools, besides the ones for reinforcement learning Strategy 1: Provide guidance with the setup Qualcomm’s AI Model Efficiency Toolkit Deployment frameworks: facilitate pruning the models and quantizing the weights for acceleration Auxiliary tools for (shallow) machine learning, debugging, assessing models and running on cloud Most used language Google’s, TF versions 1 and 2, with high level Keras API Facebook’s Other tools: NVIDIA, Intel, etc. Tensorflow Lite & PyTorch Quantization It may not be trivial to set up your development workflow 23 &
  • 24. Strategy 2: Share simple baseline code 24
  • 26. Strategy 3: Postpone using ray-tracing and adopt simple MIMO channel estimation 26 Future environment
  • 27. Strategy 4: provide support to two beam selection environments 27 Base station User Scatterer RadioStrike-noRT-v1 (PS-006 ITU Challenge) MimoRL-simple-1-v0 (easier to start with) Both have the basic elements: ITU-ML5G-PS-006-RL: challenge, learning environment and framework for building future CAVIAR simulations
  • 28. Concepts of tabular reinforcement learning 28 Q(s,a) values [1] Sutton’s & Barto’s book. Reinforcement learning: an introduction. Policy: what to do. Maps states in actions LOS NLOS Strategy to get a policy: find the “value” of a state/action pair, its long- term return Easier to visualize: grid-world example (reach a pink corner) Q-table = 128 = 5184 OpenAI gym environment MimoRL-simple-1-v0 Multi-armed bandits (MAB) are simpler RL in which the action influences the reward but not the “state”
  • 29. 29 Policy versus Q-value in simpler 4 x 4 grid Q-values for optimal policy Optimal policy [1] Sutton’s & Barto’s book. Reinforcement learning: an introduction (Example 4.1) [2] https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter04/grid_world.py Rewards The Q-value is the long-term expected return, not the immediate reward Goal is to reach one of the pink corners The reward is -1 everywhere Policy can be based on the Q(s,a) table. Learn the table first.
  • 30. DQN: From tabular method to deep RL 30 Input: state Q-value estimates (linear activation) Another advantage of a NN instead of a table: the state space (input) can be continuous (real numbers) [1] Mnih et al, Playing Atari with Deep Reinforcement Learning, 2013 Q table: expected long-term return Table can become too large!  Then use a neural network [1] Online learning, no need for output labels. Support to delayed reward Find the balance between explore and exploit Environment: •Probabilistic / deterministic •Stationary / non-stationary •Full / partial state observability Need reward engineering
  • 31. Another class of algorithms: Policy Gradient 31 Policy gradient methods: the NN output is a policy, not Q-value estimates. Supports stochastic policies. Input: state Softmax activation: distribution over actions State (input) and action spaces (output) can be continuous (real numbers) Example: an RL agent that allocates power (as real numbers) in cell-free MIMO requires a continuous action space Cell-free MIMO Discrete action example Continuous action example Input: state Activations for Gaussian means and variances
  • 32. Summary of RL Methods 32 Input: state Distribution or means/variances of actions Input: state Q-values estimates Policy gradient methods: Deep Q-network (DQN): Actor-critic (e.g.A3C): uses 2 NNs, Critic estimates Q-values and Actor the policy Tabular methods (no neural network): In all NN-based cases: # outputs neurons = # actions. PS-006 has a small # actions
  • 33. How is the ITU-ML5G-PS-006-RL simulation performed? 33 Base station serving a drone
  • 35. If one wants to avoid executing Unreal/Airsim 35
  • 36. ITU-ML5G-PS-006-RL code and associated files https://github.com/lasseufpa/ITU-Challenge-ML5G-PHY-RL Aldebaro Klautau 36
  • 37. Steps to prepare the environment and run the baseline Install package manager, e.g., Conda Create environment Activate the environment and install the packages We used Stables-Baselines 2.10 for our RL agent, so we needed Tensorflow 1.14 and Python 3.6 Run RL agent train/test Train: $ python3 train_agent.py ‘agent_name’ ‘train_episode’ outputs: ./model/’agent_name’.a2c Test: $ python3 test_agent.py ‘agent_name’ ‘test_episode’ outputs: ./data/actions_’agent_name’.csv 37
  • 38. Data organization 38 CSV text file corresponding to an episode:
  • 39. Episode example (complete information about the scene) This information does not depend on the RL agent actions and can be pre- computed. The buffer status can be used as input to the agent but need to be retrieved along the execution 39 Sampling interval Ts = 10 milliseconds Average episode duration = 3 minutes …
  • 40. Input data for baseline RL agent 40 Only data from users (discard scatterers): uav1, simulation_car2, simulation_pedestrian4
  • 42. Thanks to all ITU-ML5G-PS-006 reinforcement learning team 42 Join the challenge!
  • 43. 43