0% found this document useful (0 votes)
27 views

RL Lecture1-Introduction (IITH)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

RL Lecture1-Introduction (IITH)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

AI 3000 (CS 5500) : Reinforcement Learning

Easwar Subramanian
TCS Innovation Labs, Hyderabad

Email : [email protected]

August 03, 2024


Overview

1 Introduction

2 RL : Framework, Components and Challenges

3 Historical Notes

4 Motivation and Success Stories

5 Course Logistics

Easawr Subramanian, IIT Hyderabad 2 of 44


Introduction

Easawr Subramanian, IIT Hyderabad 3 of 44


Machine Learning
” Machine learning is about developing bots that has the ability to automatically learn and
improve from experience without being explicitly programmed ”

Figure Source: David Silver’s RL


Easawr Subramanian, IIT Hyderabad 4 of 44 course
Supervised Learning
▶ Data : (x, y) → x is data and y is label
▶ Goal: Learn a function f to map y = f (x)
▶ Problems : Classification or Regression

Classification
Figure Source: Aura Portal -
Easawr Subramanian, IIT Hyderabad 5 of 44 AI/ML Blog
Unsupervised Learning
▶ Data : (x) → Only data; No label
▶ Goal: Learn underlying structure
▶ Techniques : Clustering

Clustering
Figure Source: Aura Portal -
Easawr Subramanian, IIT Hyderabad 6 of 44 AI/ML Blog
Reinforcement Learning
▶ Data : Agent interacts with environment to collect data
▶ Goal : Agent learns to interact with environment to maximize an utility
▶ Examples : Learn a task, Navigation

Learn to cycle (task)


Figure Source:
Easawr Subramanian, IIT Hyderabad 7 of 44 worldmodels.github.io
Example : Navigation

▶ Task : Start from square S and reach square G in as less moves as possible

▶ One has to make sequence of moves


(actions)
▶ Action chosen determine which
squares (states) would be visited
subsequently
▶ Reaching the goal state will fetch a
reward; Visiting intermediate squares
(states) may or may not fetch reward
Navigation in grid world

Figure Source: Genevieve Hayes :


Easawr Subramanian, IIT Hyderabad 8 of 44 Medium Post
Sequential Decision Making

Supervised or Unsupervised Setting


▶ System is making a isolated decision; i.e., classification, regression or clustering;
▶ Decision does not affect future observations

Reinforcement Learning
▶ Generally, the agent makes a sequence of decisions (or actions)
▶ Actions affect future observations
▶ Actions taken have consequences

Easawr Subramanian, IIT Hyderabad 9 of 44


Types of Learning : Summary

Easawr Subramanian, IIT Hyderabad 10 of 44 Figure Source: Saggie


RL : Framework, Components and Challenges

Easawr Subramanian, IIT Hyderabad 11 of 44


Reinforcement Learning : Framework

▶ Observations are non i.i.d and are sequential in nature


▶ Agent’s action (may) affect the subsequent observations seen
▶ There is no supervisor; Only reward signal (feedback)
▶ Reward or feedback can be delayed

Easawr Subramanian, IIT Hyderabad 12 of 44


Example : Tic-Tac-Toe

▶ Observations : Board position


▶ Actions : Moves
▶ Reward : Win or Loss

Easawr Subramanian, IIT Hyderabad 13 of 44


Example : Robotics

▶ Observations : Image from in-built camera


▶ Actions : Motor current for movement
▶ Reward : Task success measure

Easawr Subramanian, IIT Hyderabad 14 of 44


Example : Inventory Control

▶ Observations : Stock levels


▶ Actions : What to purchase
▶ Reward : Profit
Easawr Subramanian, IIT Hyderabad 15 of 44
Components of RL : Agent and Environment

Agent
▶ Executes action upon receiving observation
▶ For taking an action the agent receives an appropriate reward

Environment
▶ An external system that an agent can perceive and act on.
▶ Receives action from agent and in response emits appropriate reward and (next)
observation

Slide Credit: David Silver’s RL


Easawr Subramanian, IIT Hyderabad 16 of 44 Course
Components of RL : State and Reward

State
▶ State can be viewed as a summary or an abstraction of the past history of the system
⋆ For example, in Tic-Tac-Toe, the state could be raw image or vector
representation of the board

Reward
▶ Reward is a scalar feedback signal
▶ Indicates how well agent acted at a certain time
▶ The agent’s aim is to maximise cumulative reward

Slide Credit: David Silver’s RL


Easawr Subramanian, IIT Hyderabad 17 of 44 Course
Reinforcement Learning : Challenges

▶ Delayed Feedback

▶ Credit Assignment Problem

▶ Stochastic Environment

▶ Definition of Reward Function

▶ Data Collection Problem

Easawr Subramanian, IIT Hyderabad 18 of 44


Historical Notes

Easawr Subramanian, IIT Hyderabad 19 of 44


Learning by Trial and Error

Tic-Tac-Toe

▶ Random movements by agent is akin to exploration


▶ Exploration can help the agent place ’X’ in square number 5
▶ Reward obtained from placing ’X’ in square number 5 can now be remembered in
terms of updating the policy or value function

Easawr Subramanian, IIT Hyderabad 20 of 44


Thondrike’s Cat : Psychophysical Experiment

Thondrike’s cat Law of Effect

Law of Effect (1898)


Any behaviour that is followed by pleasant consequences is likely to be repeated, and any
behaviour followed by unpleasant consequences is likely to be stopped

Figure Source: Oscar Education :


Easawr Subramanian, IIT Hyderabad 21 of 44 Blogpost
Pavlov’s Dog

Pavlov’s Dog Figure Source:


https://www.age-of-the-
Easawr Subramanian, IIT Hyderabad 22 of 44 sage.org/psychology/pavlov.html
Connections to Temporal Difference

▶ Ivav Pavlov laid the ground for classical conditioning (1901)


▶ First theory that incorporated time into the learning procedure
▶ Rescorla-Wagner (RW) (1972) model is a formal model to explain Pavlovian
conditioning
▶ Temporal-Difference (TD) learning, that extends RW model, is an approach to
learning how to predict a quantity that depends on future values of a given signal
(Sutton, 1984)
▶ TD learning forms the basis of almost all RL algorithms that we see today

Easawr Subramanian, IIT Hyderabad 23 of 44


Connections to Optimal Control

Easawr Subramanian, IIT Hyderabad 24 of 44


Connections to Optimal Control

▶ Outcomes are partly random and partly under the control of the decision maker
▶ Markov Decision Process (MDP) (Bellman, 1957) is used as a framework to model
and solve sequential decision problem
▶ People working in control theory have contributed to optimal sequential decision
making

Easawr Subramanian, IIT Hyderabad 25 of 44


Modern Reinforcement Learning

▶ The temporal difference (TD) thread and the optimal control thread were bought
together by Watkins (1989) when he proposed the famous Q-learning algorithm
▶ Gerald Tesauro (1992) employed TD learning to play backgammon; The developed
software agent was able to beat experts

Figure Source:
Easawr Subramanian, IIT Hyderabad 26 of 44 https://www.linuxjournal.com/article/11038
Era of Deep (Reinforcement) Learning

Deep Neural Net for Atari Games

Easawr Subramanian, IIT Hyderabad 27 of 44 Figure Source: Minh et.al, 2015


Reinforcement Learning : History

Easawr Subramanian, IIT Hyderabad 28 of 44 Slide Credit: RL Course, Abir Das


Motivation and Success Stories

Easawr Subramanian, IIT Hyderabad 29 of 44


Motivation

Why study Reinforcement Learning (RL) now ?

▶ Advances in computational capability


▶ Advances in deep learning
▶ Advances in reinforcement learning
⋆ Subject matter of this course !

Slide Credit: Sergey Levine course


Easawr Subramanian, IIT Hyderabad 30 of 44 on Deep RL at UCB
Sucess Stories

(a) Ng et al 2004 (b) Kohl et al 2004

Easawr Subramanian, IIT Hyderabad 31 of 44


Sucess Stories

(c) Minh et al 2013 (d) Schulman et al 2016

(d) Silver et al. 2016

Easawr Subramanian, IIT Hyderabad 32 of 44


Towards Intelligent Systems

▶ Things that we can all do (Walking) (Evolution, may be)


▶ Things that we learn (driving a bicycle, car etc)
▶ We learn a huge variety of things (music, sport, arts etc)
We are still far from building a ‘reasonable’ intelligent system
▶ We are taking baby steps towards the goal of building intelligent systems
▶ Reinforcement Learning (RL) is one of the important paradigm towards
that goal

Slide Credit: Sergey Levine course


Easawr Subramanian, IIT Hyderabad 33 of 44 on Deep RL at UCB
Course Logistics

Easawr Subramanian, IIT Hyderabad 34 of 44


Course Content - Part A
Modern Reinforcement Learning

▶ Markov Decision Process

▶ Dynamic Programming and Bellman Optimality Principle

▶ Value and Policy Iteration

▶ Convergence Properties of Value and Policy Iteration

▶ Model Free Prediction

▶ Model Free Control : Q-Learning and SARSA

Easawr Subramanian, IIT Hyderabad 35 of 44


Course Content - Part B
Deep Reinforcement Learning

▶ Deep Q-Learning and Variants

▶ Policy Gradient Approaches

▶ Variance Reduction in Policy Gradient Methods

▶ Actor Crtic Algorithms

▶ Deterministic Policy Gradients

▶ Advanced Policy Gradient Methods : TRPO and PPO

Easawr Subramanian, IIT Hyderabad 36 of 44


Course Prerequisites

▶ Prerequisites
⋆ Probability
⋆ Linear Algebra
⋆ Machine Learning
⋆ Deep Learning

▶ Programming Prerequisites
⋆ Good Proficiency in Python
⋆ Tensorflow / Theano / PyTorch / Keras
⋆ Other Associated Python Libraries

Easawr Subramanian, IIT Hyderabad 37 of 44


Venue and Timing

▶ Mode
⋆ In class lectures at LHC-3 (possibly recorded for MDS students)
▶ Timing
⋆ Saturday - 10.00 AM to 1.00 PM (??)
▶ Course Co-ordinator
⋆ Prof. Konda Reddy

Easawr Subramanian, IIT Hyderabad 38 of 44


Course Evaluation

▶ Assignments : Three or Four in Total (30 %)

▶ Exams : Two in Total (70 %)

Details will be in Piazza

Easawr Subramanian, IIT Hyderabad 39 of 44


Course Material : Books

Reinforcement Learning : Sutton and Barto

Reinforcement Learning and Optimal Control, Bertsekas and Tsitsiklis

Dynamic Programming and Optimal Control (I and II) by Bertsekas

Easawr Subramanian, IIT Hyderabad 40 of 44


Course Material : Online Material

David Silver’s course on Reinforcement Learning

Stanford course on Deep RL (Sergey Levine)

Deep RL BootCamp (Pieter Abeel)

John Schulman’s lectures on Policy Gradient Methods

... and many others

Easawr Subramanian, IIT Hyderabad 41 of 44


Course Material : From India

Prof. B. Ravindran’s Course on RL (NPTEL)

Dr. Abir Das’s Course on RL (IIT KGP)

Reinforcement Learning via Stochastic Approximation, Mathukumalli Vidyasagar,


Lecture Notes, 2022 (Link to online version available in Piazza)

Easawr Subramanian, IIT Hyderabad 42 of 44


Attribution and Disclaimer

▶ Most concepts, ideas and figures, that form part of course lectures, are from several
sources from across web; Most of them are listed as course material

▶ Care is taken to provide appropriate attribution; Omissions, if any, are regretted and
unintentional

▶ Material prepared only for learning / teaching purpose

▶ Original authorship / copyright rests with the respective authors / publishers

Easawr Subramanian, IIT Hyderabad 43 of 44


Enjoy Learning !

Easawr Subramanian, IIT Hyderabad 44 of 44

You might also like