0% found this document useful (0 votes)
3 views64 pages

ITEC327-W11-Asynchronous

The document covers essential concepts in artificial intelligence and machine learning, focusing on autoencoders, generative adversarial networks (GANs), and reinforcement learning. It explains how autoencoders work for data representation and dimensionality reduction, introduces GANs as competing neural networks for generating data, and outlines the fundamentals of reinforcement learning through examples like autonomous driving and gaming. Additionally, it discusses training methods, including policy gradients and Q-learning, to optimize agent performance in various environments.

Uploaded by

Junaid Akram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views64 pages

ITEC327-W11-Asynchronous

The document covers essential concepts in artificial intelligence and machine learning, focusing on autoencoders, generative adversarial networks (GANs), and reinforcement learning. It explains how autoencoders work for data representation and dimensionality reduction, introduces GANs as competing neural networks for generating data, and outlines the fundamentals of reinforcement learning through examples like autonomous driving and gaming. Additionally, it discusses training methods, including policy gradients and Q-learning, to optimize agent performance in various environments.

Uploaded by

Junaid Akram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Essentials of W11

Artificial
Reinforcement
Intelligence and
Learning
Machine Learning
Review of Last Week
• Data Representation
• Stacked (Deep) Autoencoders
• Generative Adversarial Networks (GANs)

Which face is real?

2 | Directorate | Office | Faculty | School


Patterns and Information
Noticing patterns helps to store information efficiently.

• Can you memorize the position of


all the pieces by looking for 5
seconds?
• Chess experts can!

3 | Directorate | Office | Faculty | School


How Autoencoder works
• It looks at the inputs, converts them to an efficient latent
representation, and outputs something that looks very close to the
inputs.
• An autoencoder has two parts
Ø an encoder: converts the inputs to a latent representation,
Ø a decoder: converts the internal representation to the outputs

4 | Directorate | Office | Faculty | School


Where Autoencoder works
1. Learn dense representations of the input data without
supervision
2. For dimensionality reduction
• for visualization purpose
3. Feature detectors
• for unsupervised pretraining DNNs
4. Generate new data that looks similar to the training data
• generate new faces

5 | Directorate | Office | Faculty | School


PCA with Linear Autoencoder
Task: Autoencoder perform PCA on a 3D dataset (projecting it into 2D)
- check the ipynb file to see how to get this 3D dataset

6 | Directorate | Office | Faculty | School


PCA with Linear Autoencoder
• Build and Train the model:
§ Two subcomponents: encoder + decoder
§ No. of outputs = No. of inputs

7 | Directorate | Office | Faculty | School


Implementing a Deep Autoencoder
• Two submodels: encoder + decoder
• Decoder takes codings of size 30
• Use binary cross-entropy loss

8 | Directorate | Office | Faculty | School


Visualizing the Reconstructions
To ensure the autoencoder is properly trained:
• Compare the inputs and the outputs
• The differences should not be too significant

Original Images from


validation set

Reconstructions

9 | Directorate | Office | Faculty | School


Autoencoder for Dimension Reduction
• Procedures:
1. autoencoder reduces the dimensionality to 30
2. use t-SNE to reduce the dimensionality to 2

check ipynb for the


visualization code

10 | Directorate | Office | Faculty | School


What are GANs
• 2 neural networks compete against each other during training.
• This competition will push them to excel.

Generator Discriminator
• Generate data that • Tell real data from
looks similar to the fake data
training data

A criminal trying to make The police investigator


realistic counterfeit money trying to tell real money
from fake
11 | Directorate | Office | Faculty | School
Training of GANs
• Phase 1: train the discriminator • Phase 2: train the generator
o training data: real images + same 1. use generator to produce fake images
number of fake images produced by
2. use discriminator to tell which are fake
generator
o Only optimize the weights of 3. Do not add real images in, but all the
discriminator labels are set to 1 (real image)
o Only optimize the weights of generator

12 | Directorate | Office | Faculty | School


Build a simple GAN
• Generator: similar to an autoencoder’s decoder
• Discriminator: a regular binary classifier

13 | Directorate | Office | Faculty | School


Compile a simple GAN
• Generator: only be trained through the gan model. No need to compile.
• Discriminator: a binary classifier->use binary cross-entropy loss

14 | Directorate | Office | Faculty | School


Simple GANs vs. Deep Convolutional GANs

• Generated by simple GANs • Generated by DCGANs

15 | Directorate | Office | Faculty | School


Build a Deep Convolutional GAN (DCGANs)

Guidelines for stable convolutional GANs:


• Replace any pooling layers with strided convolutions (in the
discriminator) and transposed convolutions (in the generator).
• Use Batch Normalization in both the generator and the discriminator,
except in the generator’s output layer and the discriminator’s input layer.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in the generator for all layers except the output
layer, which should use tanh.
• Use leaky ReLU activation in the discriminator for all layers.

16 | Directorate | Office | Faculty | School


Build a DCGAN
• Generator: use transposed convolutional layer
• Discriminator: a regular binary classifier (max pooling->strided convolutions)

17 | Directorate | Office | Faculty | School


Images generated by a DCGAN
• Build the dataset, then compile and train this model with the exact same
code as the simple GAN. The generated images are:

18 | Directorate | Office | Faculty | School


This week’s content

Schedule
Week Workshops

1 Course Introduction. Recap of ML basics and project design.


2 Introduction to AI
3 Introduction to deep neural network
4 Training deep neural networks
5 Custom models & training/ Preprocessing data with TensorFlow
6 Deep neural network for computer vision (object detection)
7 Introduction to natural language processing
8 Microsoft Azure AI fundamental - part 1
9 Microsoft Azure AI fundamental - part 2
10 Autoencoders and GANs
11 Reinforcement learning
12 Course review. Q & A.

19 | Directorate | Office | Faculty | School


Agenda
• Introduction to Reinforcement Learning
• OpenAI Gym
• DQN & TF-Agents Library

OpenAI Gym: Cart-pole

20 | Directorate | Office | Faculty | School


AlphaGo versus Ke Jie
Google's AlphaGo AI defeats world Go number one Ke Jie

21 | Directorate | Office | Faculty | School


What is Reinforcement Learning (RL)
An example with Autonomous Driving:

22 | Directorate | Office | Faculty | School


Introduction

Key concepts in RL
agent – observations – actions – environment – rewards

• In RL, a software agent makes observations and takes actions within


an environment, and in return it receives rewards.

• Objective: learn to act in a way that will maximize its expected


rewards over time
• positive rewards: pleasure
• Negative rewards: pain
• Max its pleasure & min its pain

23 | Directorate | Office | Faculty | School


Identify key concepts

Robotics
• agent: program controlling a robot
• environment: the real world
• observations: the agent observes the environment through a set
of sensors
• actions: sending signals to activate motor
• positive rewards: whenever it approaches the target destination
• negative rewards: whenever it wastes time or goes in the wrong
direction.

24 | Directorate | Office | Faculty | School


Identify key concepts

Ms Pac-Man
• agent: the program controlling Ms. Pac-Man
• environment: a simulation of the Atari game
• actions: 9 possible joystick positions (upper left, down, centre,…)
• observations: screenshots,
• rewards: the game points.

25 | Directorate | Office | Faculty | School


Policy Search
• The algorithm a software agent uses to determine its actions is called its
policy.
• The policy could be a neural network taking observations as inputs and
outputting the action to take:

RL using a neural network policy

26 | Directorate | Office | Faculty | School


Example on Policy

Policy for robotic vacuum cleaner


• Reward: the amount of dust picks up in 30 mins.
• Policy: in every second,
Ø move forward with some probability p
Ø randomly rotate left / right with probability 1-p (the rotation angle
would be a random angle in [-r,r] )

27 | Directorate | Office | Faculty | School


Training of the robotic cleaner
Two policy parameters: probability p + angle range r
Method 1: brute force approach
Difficult when policy space is too large (generally the case)

28 | Directorate | Office | Faculty | School


Training of the robotic cleaner
Two policy parameters: probability p + angle range r
Method 2: policy gradients (PG)
1. Evaluate the gradients of the rewards with regard to the policy parameters
2. Tweaking parameters by following the gradients toward higher rewards

• slightly increase p and evaluate whether doing so increases


the amount of dust picked up by the robot in 30 minutes
• if it does, then increase p some more, or else reduce p.

29 | Directorate | Office | Faculty | School


Agenda
• Introduction to Reinforcement Learning
• OpenAI Gym
• DQN & TF-Agents Library

OpenAI Gym: Cart-pole

30 | Directorate | Office | Faculty | School


CartPole - Reinforcement Learning Task
Hands-on task:
Train models to balance a pole on a moving cart.

31 | Directorate | Office | Faculty | School


Introduction to OpenAI Gym
• OpenAI Gym creates may environments for the agent to live in
• It is a great toolkit for developing and comparing RL algorithms.
• We create the CartPole environment with make()

initialize the environment by


calling is reset() method. This
returns an observation

obs: observations (4 floats)

32 | Directorate | Office | Faculty | School


CartPole Environment
Observations:
1. cart's horizontal position,
2. velocity
3. the angle of the pole (0 = vertical)
4. the angular velocity.
Display this environment by render()

33 | Directorate | Office | Faculty | School


CartPole’s Actions
• Two actions are possible
0: accelerating left
1: accelerating right
• Since the pole is leaning toward the right (obs[2] > 0), let’s accelerate
the cart toward the right:

the cart is now moving toward the right (obs[1] > 0).
The pole is still tilted toward the right (obs[2] > 0), but
its angular velocity is now negative (obs[3] < 0), so it
will likely be tilted toward the left after the next step.

34 | Directorate | Office | Faculty | School


CartPole’s Actions
• Step() returns 4 values
1. Obs: new observation
2. reward: get a reward of 1 at every step, so the goal is to keep the
episode running as long as possible
3. done: the value will be True when the episode is over. This will happen
when the pole tilts too much/ goes off the screen/ after 200 steps
4. info: dictionary for extra information

At the end of an episode (i.e.,


when step() returns done=True), you should reset
the environment before you continue to use it

35 | Directorate | Office | Faculty | School


A simple hard-coded policy
• A simple strategy: if the pole is tilting to the left, then push the cart to the
left, and vice versa.
• Will the cartpole successfully move 200 steps with this strategy?

We tried 500 episodes,


the max step number is
only 68

36 | Directorate | Office | Faculty | School


A simple hard-coded policy
• Check the animation part in the ipynb file to see how it failed in one
episode:

The system is unstable and after


just a few wobbles, the pole ends
up too tilted: game over.

37 | Directorate | Office | Faculty | School


Neural Network Policies
• It takes observations as inputs, and output
the probabilities of actions to take for each
observation
• How to choose an action:
The network will estimate a probability for
each action, then we will select an action
randomly according to the estimated
probabilities.

38 | Directorate | Office | Faculty | School


Neural Network Policies
• Cart-Pole environment has two possible actions (left or right), so we
only need one output neuron
Ø output the probability p of the action 0 (left),
Ø probability of action 1 (right) will be 1 - p.

39 | Directorate | Office | Faculty | School


Neural Network Policies
• Why picking a random action based on the probabilities given by the
neural network, rather than just picking the action with the highest
score?

lets the agent find the right balance between


• exploring new actions and
• exploiting the actions that are known to work well

40 | Directorate | Office | Faculty | School


Build this Neural Network Policy
• The number of inputs is the size of the observation space (which in the
case of Cart-Pole is 4),
• Only five hidden units because it’s a simple problem
• Output a single probability (the probability of going left), so we have a
single output neuron using the sigmoid activation function.

41 | Directorate | Office | Faculty | School


Neural Network without training
How well this randomly initialized policy network performs?
• Pretty bad.
• The neural network will have to learn to do better.

42 | Directorate | Office | Faculty | School


How to train this neural network?
• If we knew what the best action was at
each step, we could train the NN as usual.
• However, in RL, we only know the rewards,
which are sparse and delayed

All it knows is that the pole fell after the last


action, but surely this last action is not entirely
responsible.

43 | Directorate | Office | Faculty | School


Evaluating Actions
• Evaluate an action based on the sum of all the rewards that come after
it, usually applying a discount factor γ (gamma) at each step
• This sum of discounted rewards is called the action’s return.
• Discount factor for CartPole is 0.95 (actions have short-term effects)

• An agent decides to go right 3 times


in a row and gets +10 reward after
the first step, 0 after the second
step, and finally –50 after the third
step

• Assuming use a discount factor γ =


0.8
• the first action will have a return of
10 + γ x 0 + γ2 x (–50) = –22.

44 | Directorate | Office | Faculty | School


Evaluating Actions
• Action advantage:
Estimate how much better or worse an action is, compared to the other
possible actions, on average.

• Solution:
Run many episodes and normalize all the action returns (by subtracting
the mean and dividing by the standard deviation).
Then, actions with a negative advantage were bad while actions with a
positive advantage were good

45 | Directorate | Office | Faculty | School


Train agent using policy gradients
• PG algorithms optimize the parameters of a policy by following the gradients
toward higher rewards.
• Reinforce algorithms are one class of PG algorithms:
1. First, let the neural network policy play the game several times, and at each step,
compute the gradients that would make the chosen action even more likely—but
don’t apply these gradients yet.
2. Once you have run several episodes, compute each action’s advantage
3. If an action’s advantage is positive, it means that the action was probably good, and
you want to apply the gradients computed earlier to make the action even more likely
to be chosen in the future.
If the action’s advantage is negative, it means the action was probably bad, and you
want to apply the opposite gradients to make this action slightly less likely in the
future.
4. Finally, compute the mean of all the resulting gradient vectors, and use it to perform
a Gradient Descent step.

46 | Directorate | Office | Faculty | School


Train agent using policy gradients
• Please open ipynb file to see the details of each step

47 | Directorate | Office | Faculty | School


The trained agents
The pole did not tilt after 200 steps.

48 | Directorate | Office | Faculty | School


Agenda
• Introduction to Reinforcement Learning
• OpenAI Gym
• DQN & TF-Agents Library

49 | Directorate | Office | Faculty | School


Q-Learning: model free RL algorithm
Thinking in another angle:
The pole has different states. Can we find the best action under each
state?

• Luckily, we have an algorithm to estimate the optimal state-action


values, generally called Q-values (Quality Values).
• The optimal Q-Value of the state-action pair (s, a), is noted Q*(s, a).
• It is the sum of discounted future rewards the agent can expect on
average after it reaches the state s and chooses action a, but before it
sees the outcome of this action, assuming it acts optimally after that
action

50 | Directorate | Office | Faculty | School


Deep Q-Learning
• Use a neural net that takes a state and outputs one approximate Q-
Value for each possible action.
• To solve the CartPole environment, a couple of hidden layers will do the
work.

51 | Directorate | Office | Faculty | School


Train agent with Deep Q-Network
• Please open ipynb file to see the details of each step

52 | Directorate | Office | Faculty | School


Introduction to RL and Deep Q Networks
TF’s tutorial:

53 | Directorate | Office | Faculty | School


The TF-Agents Library

link

54 | Directorate | Office | Faculty | School


The TF-Agents Library
• The TF-Agents library is a Reinforcement Learning library based on
TensorFlow
• Like OpenAI Gym, it provides many off-the-shelf environments
(including wrappers for all OpenAI Gym environments)
• It also implements many RL algorithms, including REINFORCE, DQN,
and DDQN

Use TF-Agents to train an


agent to play Breakout, the
famous Atari game

55 | Directorate | Office | Faculty | School


Reinforcement Learning in TensorFlow with
TF-Agents

Link:

56 | Directorate | Office | Faculty | School


DQN Tutorial with TF-Agents
TF’s tutorial on CartPole:

57 | Directorate | Office | Faculty | School


Online Activities

58 | Directorate | Office | Faculty | School


10-minute Quiz

59 | Directorate | Office | Faculty | School


Links for Week 11
• Google’s AlphaGo AI defeats world Go number one
Ke Jie
• https://www.theverge.com/2017/5/23/15679110/go-
alphago-ke-jie-match-google-deepmind-ai-2017
• An introduction to Reinforcement Learning
https://www.youtube.com/watch?v=JgvyzIkgxF0
• Reinforcement Learning.
https://github.com/ageron/handson-
ml2/blob/master/18_reinforcement_learning.ipynb
• Introduction to RL and Deep Q Networks
https://www.tensorflow.org/agents/tutorials/0_intro_rl
• REINFORCE agent
https://www.tensorflow.org/agents/tutorials/6_reinforc
e_tutorial
• What is reinforcement learning?
• https://www.youtube.com/watch?v=TpMIssRdhco

60 | Directorate | Office | Faculty | School


Review

Self-test
• Key Concepts:
• Environment / agent / rewards / actions
• TF-Agents Library
• Policy search

61 | Directorate | Office | Faculty | School


Summary

After workshop
• Review today’s workshop, including slides, jupyter notebook, and
textbook (Chapter 17)
• Go through TF’s tutorials on Reinforcement Learning and DQN

62 | Directorate | Office | Faculty | School


Thank you!
Have a nice day!
Acknowledgements
The resources used in this slides are
created from:
• Referenced textbook resources
and
• Web resources including public
images, scientific documents,
course materials, and source
codes.

You might also like