RL Lecture1-Introduction (IITH)
RL Lecture1-Introduction (IITH)
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : [email protected]
1 Introduction
3 Historical Notes
5 Course Logistics
Classification
Figure Source: Aura Portal -
Easawr Subramanian, IIT Hyderabad 5 of 44 AI/ML Blog
Unsupervised Learning
▶ Data : (x) → Only data; No label
▶ Goal: Learn underlying structure
▶ Techniques : Clustering
Clustering
Figure Source: Aura Portal -
Easawr Subramanian, IIT Hyderabad 6 of 44 AI/ML Blog
Reinforcement Learning
▶ Data : Agent interacts with environment to collect data
▶ Goal : Agent learns to interact with environment to maximize an utility
▶ Examples : Learn a task, Navigation
▶ Task : Start from square S and reach square G in as less moves as possible
Reinforcement Learning
▶ Generally, the agent makes a sequence of decisions (or actions)
▶ Actions affect future observations
▶ Actions taken have consequences
Agent
▶ Executes action upon receiving observation
▶ For taking an action the agent receives an appropriate reward
Environment
▶ An external system that an agent can perceive and act on.
▶ Receives action from agent and in response emits appropriate reward and (next)
observation
State
▶ State can be viewed as a summary or an abstraction of the past history of the system
⋆ For example, in Tic-Tac-Toe, the state could be raw image or vector
representation of the board
Reward
▶ Reward is a scalar feedback signal
▶ Indicates how well agent acted at a certain time
▶ The agent’s aim is to maximise cumulative reward
▶ Delayed Feedback
▶ Stochastic Environment
Tic-Tac-Toe
▶ Outcomes are partly random and partly under the control of the decision maker
▶ Markov Decision Process (MDP) (Bellman, 1957) is used as a framework to model
and solve sequential decision problem
▶ People working in control theory have contributed to optimal sequential decision
making
▶ The temporal difference (TD) thread and the optimal control thread were bought
together by Watkins (1989) when he proposed the famous Q-learning algorithm
▶ Gerald Tesauro (1992) employed TD learning to play backgammon; The developed
software agent was able to beat experts
Figure Source:
Easawr Subramanian, IIT Hyderabad 26 of 44 https://www.linuxjournal.com/article/11038
Era of Deep (Reinforcement) Learning
▶ Prerequisites
⋆ Probability
⋆ Linear Algebra
⋆ Machine Learning
⋆ Deep Learning
▶ Programming Prerequisites
⋆ Good Proficiency in Python
⋆ Tensorflow / Theano / PyTorch / Keras
⋆ Other Associated Python Libraries
▶ Mode
⋆ In class lectures at LHC-3 (possibly recorded for MDS students)
▶ Timing
⋆ Saturday - 10.00 AM to 1.00 PM (??)
▶ Course Co-ordinator
⋆ Prof. Konda Reddy
▶ Most concepts, ideas and figures, that form part of course lectures, are from several
sources from across web; Most of them are listed as course material
▶ Care is taken to provide appropriate attribution; Omissions, if any, are regretted and
unintentional