2021 itu challenge_reinforcement_learning

ITU Artificial Intelligence/Machine Learning in 5G Challenge
Radio-Strike: A Reinforcement Learning
Game for MIMO Beam Selection in
Unreal Engine 3-D Environments
Aldebaro Klautau
Federal University of Pará (UFPA) / LASSE
http://ai5gchallenge.ufpa.br
July 02, 2021
Joint work with Prof. Francisco Müller (UFPA)
and several students

UFPA
Federal University of Pará
• Established in 1957
• Largest academic and research institution
in the Amazon (Pará state in Northern
Brazil)
• One of the largest Brazilian universities
with total population (students + staff) of
~60k people
• One of the missions is the sustainable
development of the region through
science and technology
3

Agenda
Motivation
Beam selection
Radio Strike
Reinforcement learning concepts (brief)
Problem ITU-ML5G-PS-006 reinforcement learning
4

Machine learning for communications:
importance to industry
Standardization bodies
discussing AI / ML
ITU
Architectural
Framework
for ML
Rec. Y.3172
Network
automation
and resource
adaptation
Rec. Y.3177
3GPP
Network Data
Analytics
Function
(NWDAF)
TR 23.791
Analytics in
5G Core
TS 23.288
ETSI
Experiential
Networked
Intelligence
(ENI)
Zero-Touch
Network and
Service
Management
(ZSM)
Linux
Foundation
AI in Open
Network
Platform
(ONAP)
ML and open
data
platforms
O-RAN
RAN
Intelligent con
troller (RIC)
and Near-RT
RIC
Data-driven
workflows for
closed-control
loops

Machine learning for communications
(ML4COMM) still faces the small data regime
7
Amount of data
Performance
Deep learning
most others
small data regime
Models are relatively small and
ML4COMM has yet to escape small
data regime
Data scarcity is an issue. Problem of
traditional cycle: high cost of
measurements when using high
frequencies and multiple antennas
For instance, reinforcement learning
(RL) agents applied to
communications typically have a
small action space dimension

Key alternative for speeding up ML4COMM:
use simulations to generate large datasets
8
https://www.itu.int/en/ITU-T/academia/kaleidoscope/2021/Pages/default.aspx
Virtual Reality in Real Time: FastNeRF
accelerates photorealistic 3D rendering via
Neural Radiance Fields (NeRF) to visualize
scenes at 200 frames per second
https://arxiv.org/abs/2103.10380v2
We will run simulations much faster than real-time

9
1957 1980 1986 1989 1994 2006 2011 2012
[Rosenblatt]
“Perceptron”
[Fukushima]
“Convolutional”
layers
[Rumelhart
et al]
“Multilayer
Perceptron”
(MLP)
[Hinton et al]
Deep semi-
supervised
belief nets
Algorithms
Historical evolution of neural networks
applied to speech recognition
Speech recognition
[Renals et al]
HMM/MLP, 69
outputs, 1
hidden layer,
300k parameters
[Seide et al]
SWBD-1
breakthrough:
9304 outputs, 7
hidden layers,
15M parameters
TI / MIT (TIMIT)
dataset was
released

From (detailed) TIMIT to large (SWBD) datasets
10
Word
error
rate
(WER)
Breakthrough:
Switchboard-1 (SWBD-1) dataset
Hidden Markov models (HMMs)
with Gaussian mixtures (GMMs)
HMMs with deep
convolutional nets
LSTM + tricks
• TIMIT dataset has detailed time-aligned orthographic and phonetic transcriptions
• In 1986, took 100 to 1000 hours of work to transcribe each hour of speech
• Project cost over 1 million dollars
• Five phoneticians agreed on 75% to 80% of cases
When speech
recognition
reached the large
data regime

Simulating communication systems + AI + VR / AR
6G systems are expected to support applications such as augmented
reality, multisensory communications and high fidelity holograms. This
information will flow through the network. It is expected that 6G
systems will use ML/AI to leverage such multimodal data and optimize
performance
11
This requires a simulation environment that is capable not only of
generating communication channels, but also the corresponding sensor
data, matched to the scene
Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf
ITU-ML5G-PS-006-RL: Communication networks and Artificial
intelligence immersed in VIrtual or Augmented Reality (CAVIAR)

CAVIAR: get “measurements” on virtual worlds
12
Generating MIMO Channels For 6G Virtual Worlds Using Ray-tracing Simulations https://arxiv.org/pdf/2106.05377.pdf

Improving communications with antenna arrays
14
Wavelength l=c/f
l=5 mm when f=60 GHz
Space between
antenna elements = l/2
Array form factor decreases
when frequency increases
(mmWave in 5G / THz in 6G)
THz
mmWave
Illustrative radiation patterns of an array:
One antenna
2 antennas 36 antennas
Beam
Given a phased antenna array, we choose a
“beamvector” to impose a radiation pattern

Beam selection in 5G mmWave
RF
Chain
Phase shifters
specified by
codebooks
RFain
DAC Baseband
Baseband RF
Chain
RFain
ADC
Analog beamforming Analog combining
Align the beams of transmitter and receiver
Brute force to find best: try all possible Mt x Mr pairs of indices
Wireless channel H
Codebook with
Mt vectors f
Codebook with
Mr vectors w
[1] Heath et al, An Overview of Signal Processing Techniques for Millimeter Wave MIMO Systems, 2016
Transmitter (Tx) Receiver (Rx)
𝑦 𝑖, 𝑗 = |𝒘𝑗
𝐻
𝑯𝒇𝑖|
Goal: maximize

ML-based beam selection in 5G mmWave:
often modeled as supervised learning
16
Tx codebook Rx codebook
Index
i=1
2
Pair or single index
j=1
2
(1, 1)
(1, 2)
(2,1)
(2,2)
0
1
2
3
Index
Inputs from
communication
system and also
from sensors
such as GPS
Example with
two beamvectors
per codebook
Typically posed
as a classification
problem. We will
assume RL

Part III – Radio Strike
A Reinforcement Learning Game for MIMO Beam Selection in Unreal
Engine 3-D Environments
17

Reinforcement learning with OpenAI Gym
18
A
S R
Environ.
Reward
Action
Environment
State
Goal: Find a policy that maximizes the return
over a lifetime (episode, if not a continuing task)
https://gym.openai.com/
https://www.slideshare.net/a4aleem/reinforcement-learning-using-openai-gym
We adopt the popular OpenAI Gym API

19
After training the RL agent:
Using random actions:
Similarly, we want to choose the beam
and maximize performance with
respect to throughput and packet loss

Aldebaro Klautau 20
Problem: scheduling
and beam-selection
in downlink
Position,
speed, etc.
RL agent

Reinforcement learning for beam selection:
RadioStrike-noRT
-v1
21
A
S R
Environ.
The RL agent is executed at a base station (BS) with an antenna array and serves single-antenna
users on downlink using an analog MIMO architecture: pedestrian, drone and car
Action: at each time slot the
agent action schedules one user
and chooses the beam index to
serve this user
Reward: normalized throughput with a penalty for dropped packets
State (or observation): position
and buffer status of each user,
previously scheduled users, etc.
Return (in the end of the episode): sum of rewards
The state is defined by
the participant, as well
as eventual “intrinsic”
rewards

ITU-ML5G-PS-006:
research questions and strategies
22
Some questions:
- When performing user scheduling and beam selection, does position information
help the scheduler?
- Can we benefit from knowing the positions of scatterers?
From experience with 2020 Challenge:
- Help participants with the (eventually steep) learning curve
- Besides the main problem, discuss related simpler tasks and provide support
Keep evolving:
- Build together increasingly difficult CAVIAR “games”
- Create benchmarks for realistic applications of RL in 5G and 6G

Several specialized tools, besides the ones for reinforcement learning
Strategy 1: Provide guidance with the setup
Qualcomm’s AI Model
Efficiency Toolkit
Deployment frameworks: facilitate pruning the models and quantizing the weights for acceleration
Auxiliary tools for (shallow) machine learning, debugging, assessing models and running on cloud
Most used language Google’s, TF versions 1 and 2, with high level Keras API
Facebook’s
Other tools: NVIDIA,
Intel, etc.
Tensorflow Lite
& PyTorch Quantization
It may not be trivial to set up your development workflow
23
&

Strategy 2: Share simple baseline code
24

Strategy 3: Postpone using ray-tracing and
adopt simple MIMO channel estimation
26
Future
environment

Strategy 4: provide support to two beam
selection environments
27
Base station
User
Scatterer
RadioStrike-noRT-v1
(PS-006 ITU Challenge)
MimoRL-simple-1-v0
(easier to start with)
Both have the basic elements:
ITU-ML5G-PS-006-RL: challenge, learning environment and
framework for building future CAVIAR simulations

Concepts of tabular reinforcement learning
28
Q(s,a) values
[1] Sutton’s & Barto’s book. Reinforcement learning: an introduction.
Policy: what to do.
Maps states in actions
LOS
NLOS
Strategy to get a policy:
find the “value” of a
state/action pair, its long-
term return
Easier to visualize:
grid-world example
(reach a pink corner)
Q-table
= 128
= 5184
OpenAI gym environment
MimoRL-simple-1-v0
Multi-armed bandits (MAB) are
simpler RL in which the action
influences the reward but not
the “state”

29
Policy versus Q-value in simpler 4 x 4 grid
Q-values for
optimal policy Optimal policy
[1] Sutton’s & Barto’s book. Reinforcement learning: an introduction (Example 4.1)
[2] https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter04/grid_world.py
Rewards
The Q-value is the long-term expected
return, not the immediate reward
Goal is to reach one of the pink corners
The reward is -1
everywhere
Policy can be based
on the Q(s,a) table.
Learn the table first.

DQN: From tabular method to deep RL
30
Input:
state
Q-value
estimates
(linear activation)
Another advantage of a NN instead of a table:
the state space (input) can be continuous (real
numbers) [1] Mnih et al, Playing Atari with Deep Reinforcement Learning, 2013
Q table: expected
long-term return
Table can
become too
large! 
Then use a
neural
network [1]
Online learning, no need for
output labels. Support to
delayed reward
Find the balance between
explore and exploit
Environment:
•Probabilistic / deterministic
•Stationary / non-stationary
•Full / partial state
observability
Need reward engineering

Another class of algorithms:
Policy Gradient
31
Policy gradient methods: the NN output is a policy, not Q-value estimates.
Supports stochastic policies.
Input:
state
Softmax
activation:
distribution
over actions
State (input) and action spaces (output) can be continuous (real numbers)
Example: an RL agent that allocates power
(as real numbers) in cell-free MIMO requires
a continuous action space
Cell-free MIMO
Discrete action example Continuous action example
Input:
state
Activations
for Gaussian
means and
variances

Summary of RL Methods
32
Input:
state
Distribution or
means/variances
of actions
Input:
state
Q-values
estimates
Policy gradient methods:
Deep Q-network (DQN):
Actor-critic (e.g.A3C): uses 2 NNs, Critic estimates Q-values and Actor the policy
Tabular methods (no neural network):
In all NN-based cases: # outputs neurons = # actions. PS-006 has a small # actions

How is the ITU-ML5G-PS-006-RL simulation
performed?
33
Base
station
serving a
drone

If one wants to avoid executing Unreal/Airsim
35

ITU-ML5G-PS-006-RL code and associated files
https://github.com/lasseufpa/ITU-Challenge-ML5G-PHY-RL
Aldebaro Klautau 36

Steps to prepare the environment and run the baseline
Install package manager, e.g., Conda
Create environment
Activate the environment and install
the packages
We used Stables-Baselines 2.10 for our RL agent, so we needed
Tensorflow 1.14 and Python 3.6
Run RL agent train/test
Train: $ python3 train_agent.py ‘agent_name’ ‘train_episode’
outputs: ./model/’agent_name’.a2c
Test: $ python3 test_agent.py ‘agent_name’ ‘test_episode’
outputs: ./data/actions_’agent_name’.csv
37

Data organization
38
CSV text file corresponding to an episode:

Episode example (complete information
about the scene)
This information does not depend on the RL agent actions and can be pre-
computed. The buffer status can be used as input to the agent but need to be
retrieved along the execution
39
Sampling interval Ts = 10 milliseconds
Average episode duration = 3 minutes
…

Input data for baseline RL agent
40
Only data from users
(discard scatterers):
uav1, simulation_car2,
simulation_pedestrian4

Thanks to all
ITU-ML5G-PS-006
reinforcement learning
team
42
Join the challenge!

2021 itu challenge_reinforcement_learning

More Related Content

What's hot (20)

Similar to 2021 itu challenge_reinforcement_learning (20)

Recently uploaded (20)

2021 itu challenge_reinforcement_learning