0% found this document useful (0 votes)
14 views

Machine Learning-5

The document discusses the topic of machine learning. It provides a syllabus for a course on machine learning that covers topics like supervised learning, bias-variance tradeoff, support vector machines, clustering, reinforcement learning and boosting. The syllabus lists relevant chapters from textbooks and online resources for each topic.

Uploaded by

ralac55582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Machine Learning-5

The document discusses the topic of machine learning. It provides a syllabus for a course on machine learning that covers topics like supervised learning, bias-variance tradeoff, support vector machines, clustering, reinforcement learning and boosting. The syllabus lists relevant chapters from textbooks and online resources for each topic.

Uploaded by

ralac55582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Machine Learning

B.Tech.

Nayan Ranjan Paul


Department of CSE
Silicon Institute of Technology
Syllabus
Topic Book/Chapter
Module I

Books
Overview of supervised learning T1/Ch. 2
K-nearest neighbour T1/Ch. 2.3.2, R2/Ch.8.2
Multiple linear regression T1/Ch. 3.2.3
Shrinkage methods (Ridge regression, Lasso regression) T1/Ch. 3.4
Logistic regression T1/Ch. 4.3

T1: T Hastie, R.Tibshirani and J Friedman, The Elements of Statistical Linear Discriminant Analysis T1/Ch. 4.4
Learning – Data Mining Inference and Prediction, 2 nd Edition, Springer Feature selection
Module II
T1/Ch. 5.3

2009. Bias, Variance, and model complexity T1/Ch. 7.2


Bias-variance trade off T1/Ch. 7.3

T2: S. Haykin, Neural Networks and Learning Machines, 3rd Edition, Bayesian approach and BIC T1/Ch. 7.7
Pearson Education, 2009. Cross- validation T1/Ch. 7.10
Boot strap methods T1/Ch. 7.11

T3: E. Alpaydin, Introduction to Machine Learning, 2 nd Edition, Performance of Classification algorithms(Confusion Matrix, R7
Prentice Hall of India, 2010. Precision, Recall and ROCCurve)
Module III

R1: Y. G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction Generative model for discrete data
Bayesian concept learning R2/Ch. 6
to Statistical Learning with Applications in R, 2nd Edition, Springer, Naive Bayes classifier T1/Ch. 6.6.3, R2/Ch. 6
2013. SVM for classification T1/Ch. 12.3.1, R3/Ch. 1.5, T2/Ch. 6
Reproducing Kernels R5, R6, T2/Ch. 6

R2: T. M. Mitchell, Machine Learning, 1 st Edition, McGraw-Hill SVM for regression T1/Ch. 12.3.6, R3/Ch. 1.6, T2/Ch. 6
Education, 2013 Regression and classification trees T1/Ch. 9.2.2, 9.2.3
Random forest T1/Ch. 15

R3: B. Scholkopf, A. J. Smola, Learning with Kernels, MIT Press, 2002 Module IV
Clustering (K-means, spectral clustering) T1/Ch. 13.2.1

R4: Murphy, Machine Learning - A Probabilistic Perspective, MIT Press, Feature Extraction (Principal Component Analysis (PCA) R1/10.2
2012 Kernel based PCA
Independent Component Analysis (ICA)
T1/Ch. 14.5.4
R4/Ch. 12.6, CS229

R5: N. Aronszajn. Theory of reproducing kernels. Transactions of the Non-negative matrix factorization T1/14.6
Mixture of Gaussians R4/Ch. 11.2.1, CS229
American Mathematical Society, 68 (1950): 337–404 Expectation Maximization (EM) algorithm R4/Ch. 11.4, CS229
Module V

R6: S. Saitoh. Theory of Reproducing Kernels and its Applications. Boosting methods-exponential loss and AdaBoost T1/Ch. 10.4
Longman Scientific & Technical, 1988 Numerical Optimization via gradient boosting T1/Ch. 10.10
Introduction to Reinforcement Learning T3/Ch. 18.1

R7:https://www.kdnuggets.com/2020/01/guide-precision-recall- Elements of Reinforcement Learning T3/Ch. 18.3
confusion-matrix.html Single State Case: K-Armed Bandit T3/Ch. 18.2
Model-Based Learning (Value Iteration, Policy Iteration) T3/Ch. 18.4
Module - V
Boosting

Boosting is an ensamble approach.

What is Ensemble Classification?
– Combine the decisions of different weak classifiers
– Can be more accurate than the individual classifiers
– Generate a group of base-learners
– Different learners use different

Algorithms

Hyperparameters

Representations (Modalities)

Training sets
Boosting

Why should ensamble approach work?
– Works well only if the individual classifiers disagree

Error rate < 0.5 and errors are independent

Error rate is highly correlated with the correlations of the errors made by
the different learners
Boosting

Boosting is a general method for improving the accuracy of any given ML models.

Therefore, it is not a new model, but an add-on to the existing classification and
regression models.

Boosting is a sequential ensemble learning method that combines a set of weak
learners into a strong learner to minimize training errors.

Weak learner
– have low prediction accuracy, similar to random guessing.
– They are prone to overfitting

Strong learner
– Strong learners have higher prediction accuracy.
Benifits of Boosting

Ease of implementation
– Boosting has easy-to-understand and easy-to-interpret algorithms that learn
from their mistakes. No data preprocessing required. Most languages have
built-in libraries to implement boosting algorithms with many parameters that
can fine-tune performance.

Reduction of bias
– Boosting algorithms combine multiple weak learners in a sequential method,
which iteratively improves observations. This approach helps to reduce high
bias that is common in machine learning models.

Computational efficiency
– Boosting algorithms prioritize features that increase predictive accuracy during
training. They can help to reduce data attributes and handle large datasets
efficiently.
Boosting

Consider three weak models having accuracy slightly above 50%

Model Accuracy
M1 60%
M2 62%
M3 55%


Boosting is a method of combining weak models M1, M2, M3 and so as to yield a
higher accuracy model (say, 80%).
Boosting

Consider a two class problem with output variable coded as .


Given a data set D = {(xi,yi)} where i = 1,2,3,....N of patterns,


A classifier produces a prediction taking one of the two values {-1,1}.


Then the error rate on the training sample is


The purpose of boosting is to sequentially apply the weak classification algorithm
to repeatedly modified versions of the data which produces a sequence of weak
classifiers
Boosting

The predictions from all of them are then combined through a weighted majority
vote to produce the final prediction:

where are computed by the boosting algorithm.


Boosting

The following figure shows a schematic of the AdaBoost procedure.
Algorithm AdaBoost
Algorithm AdaBoost
Algorithm AdaBoost
Algorithm AdaBoost
Gradient Boosting
Gradient Boosting
Minimization of Error in Gradient Boosting
Gradient Boosting Process
Gradient Boosting Algorithm
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
Example - Gradient Boosting
XGBOOST
Practice Questions-Boosting

What is Ensemble Classification?

What is Boosting?

How does a boosting algorithm help in improving the performance of a set of
weak machine learning algorithms' performance?

What is the need of Boosting? State and explain AdaBoost algorithm with
schematic diagram.
Reinforcement Learning (Basic Idea)

Learn to take correct actions over time by experience

Similar to how humans learn: “trial and error”

Try an action –
– “see” what happens
– “remember” what happens
– Use this to change choice of action next time you are in the same situation
– “Converge” to learning correct actions

Focus on long term return, not immediate benefit
– Some action may seem not beneficial for short term, but its long term return
will be good.
What is Reinforcement Learning ?

An approach to Artificial Intelligence

Learning from interaction

Goal-oriented learning

Learning about, from, and while interacting with an external environment

Learning what to do—how to map situations to actions—so as to maximize a
numerical reward signal

No detail supervision avaliable

Sequence of actions required to obtain reward

Typically in a stochastic world.
What is Reinforcement Learning ?

Training of ML model(AI Agent) to make a sequence of decisions in an uncertain,
potentially complex environment.

In reinforcement learning, the learner is a decision-making agent that takes actions
in an environment and receives reward (or penalty) for its actions in trying to solve
a problem.

After a set of trial-and-error runs, it should learn the best policy, which is the
sequence of actions that maximize the total reward.
What is Reinforcement Learning ?

Example:

To build a machine that learns to play chess.

It cannot use a supervised learner for two reasons.
– First, it is very costly to have a teacher that will take us through many games
and indicate us the best move for each position.
– Second, in many cases, there is no such thing as the best move; the goodness
of a move depends on the moves that follow.

A single move does not count; a sequence of moves is good if after playing them
we win the game.

The only feedback is at the end of the game when we win or lose the game.
What is Reinforcement Learning ?

There is a decision maker, called the agent, that is placed in an environment (see
figure in slide 18).


In chess, the game-player is the decision maker(Agent) and the environment is the
board.


At any time, the environment is in a certain state that is one of a set of possible
states—for example, the state of the board in chess playing game.


The decision maker(Agent) has a set of actions possible-for example, legal
movement of pieces on the chess board in chess playing game.
What is Reinforcement Learning ?

Once an action is chosen and taken, the state changes.


The solution to the task requires a sequence of actions, and we get feedback, in the
form of a reward rarely, generally only when the complete sequence is carried out.


The learning agent learns the best sequence of actions to solve a problem
– where “best” is quantified as the sequence of actions that has the maximum
cumulative reward.


Such is the setting of reinforcement learning.


It is learning to make good decisions under uncertanity.
Challenges in Reinforcement Learning

Simulation and setup the environment


Scaling and tweeking the neural network controlling the agent


There does not exist any mode of communication to the network other than reward
and penalties.
Difference between ML, DL, and RL

ML is a form of AI in which computers are given the ability to
progressively(supervised and supervised) improve the performance of a specific
task with data.


DL models consists of few neural network layers which gradually learn more
abstract features from data.


RL employs a system of rewards and penalities and compels the agent to solve a
problem by itself.


Fundamental challenge in AI and ML is learning to make good decisions under
uncertainty.
Reinforcement Learning Involves

Optimization

Delayed consequences

Exploration

Generalization
Optimization

Goal is to find an optimal way to make decisions
– Yielding best outcomes or at least very good outcomes


Explicit notion of utility of decisions


Example: finding minimum distance route between two cities given network of
roads
Delayed Consequences

Decisions now can impact things much later...
– Saving for retirement
– Health insurance

Introduces two challenges
– When planning: decisions involve reasoning about not just immediate benefit
of a decision but also its longer term ramifications
– When learning: temporal credit assignment is hard (what caused later high or
low rewards?)
– Temporal credit assignment problem – How do the agent figure out the causal
relationship between the decisions it made in the past and the outcome in
future.
Exploration

Learning about the world by making decisions
– Agent as scientist
– Learn to ride a bike by trying (and failing)


Decisions impact what we learn about
– If you choose to go to IIT instead of SIT, you will have different later
experiences...
Generalization

Policy is mapping from past experience to action.


Why not just pre-program a policy?
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
RL vs Other AI and Machine Learning
Imitation Learning


Use experience to guide future decisions.

Sequential decision making under uncertainty
Modeling Real-World Problem

Let S: Set of possible states of the model
st : sequence of states with respect to time.
A : Set of possible actions

– P(st+1|st, at, st-1, at-1,...,s1, a1) : Transition dynamics


i.e Probability of entering into state st+1 from “st” with action “at” and previous
states si, action ai for all i=t to i>=1.
Markov Property

Consider a stochastic process (s0, s1, s2, . . .) evolving according to some transition
dynamics.

We say that the stochastic process has the Markov property if and only if
P(st+1|st, at, st-1, at-1 . . . , s1,a1) = P(st+1|st, at)

i.e. the transition probability of the next state conditioned on the history including
the current state and current action is equal to the transition probability of the next
state conditioned only on the current state and current action.

In such a scenario, the current state is a suficient statistic of history of the
stochastic process.

we say that the future is independent of the past given present.

It is memoryless property -> i.e systems only rembers current state and action.
Example: Mars Rover Markov Decision Process

States S: Location of rover: {s1, s2, s3, ..., s7}

Actions A: TryLeft / TryRight

Rewards: +1 in state s1
+10 in state s7
0 in all other states
Components of RL Algorithm

Model

Policy

Value Function
Model

It is the mathematical description of the transitions and rewards of the agent’s
environment.


Agent’s representation of how world changes given agent’s action


Transition / dynamics model predicts next agent state
P(st+1 = s’|st = s, at = a)


Reward model predicts immediate reward
r(st = s, at = a) = E(rt|st = s, at = a)
Example: Mars Rover Stochastic Markov Model
Policy

It is a function π: S →A that maps agent’s states to action. i.e. it determine which
action should be taken in state s.

– Detrministic policy : π(s) = a


– Stochastic policy : π(a|s) = P(at=a|st=s)
Example: Mars Rover Policy
Value Function

Value function Vπ : expected discounted sum of future rewards under a particular
policy π


It is the cumulative sum of future rewards obtained by the agent starting from the
state “s” and policy “π”.


γ: it is the discount factor


Can be used to quantify goodness/badness of states and actions
Example: Mars Rover Value Function
Markov Process

A Markov process is a stochastic process that satisfies the Markov property,
because of which we say that a Markov process is memoryless.


Assumptions:
– Finite state space : The state space of the Markov process is fnite. This means
that for the Markov process (s0, s1, s2, . . .), there is a state space S with |S| < ∞
– Stationary transition probabilities : The transition probabilities are time
independent. Mathematically, this means the following:
P(si = s’|si-1 = s) = P(sj = s’|sj-1 = s) , ∀ s, s’ ∈ S , ∀ i, j = 1, 2, . . . .


A Markov process satisfying these assumptions is called “Markov Chain”
Markov Process

A Markov process is defined by the tuple(S,P) where
– S: A finite state space
– P: Transition probability P(st+1= s’|st = s)
The matrix P is a non-negatve row-stochastic matrix, i.e. the sum of each row
=1
– If finite number (N) of states, can express P as a matrix
Example: Mars Rover Markov Chain Transition Matrix, P
MRP:Markov Reward Process

Example: Mars Rover Markov Chain Episodes
– Let Mars rover is at s4
– Example: Sample episodes starting from s4
– s4,s5,s6,s7,s7,s7, . . .
– s4,s3,s2,s1,s2,s3,s4,s5 . . .
– s4,s4,s5,s6,s6 . . .
MRP:Markov Reward Process

Markov Reward Process is a Markov Chain + rewards


Definition of Markov Reward Process (MRP) is a 4 tuple (S,P,R,γ) where
– S is a (finite) set of states (s ∈ S)
– P is dynamics/transition probability that specifices P(st+1 = s’|st = s)
– R is a reward function that maps states to rewards (real numbers),
i.e R : S → R
– Discount factor between 0 and 1 i.e. γ ∈ [0, 1]
MRP:Markov Reward Process

Reward function:
– In a Markov reward process, whenever a transition happens from a current
state s to a successor state s’ , a reward is obtained depending on the current
state s.

– Thus for the Markov process (s0, s1, s2, . . .),each transition si → si+1 is
accompanied by a reward ri for all i = 0, 1, . . .

– A particular episode of the Markov reward process is represented as (s0, r0, s1,
r1, s2, r2, . . .)
MRP:Markov Reward Process

Reward function:
– Rewards can be either deterministic or stochastic.

– In the deterministic case, mathematically this means that for all realizations of
the process we must have that: ri = rj , whenever si = sj ∀ i, j = 0, 1, . . . ,

– For a state s∈S, the expected reward R(s) = E[ri|si=s] ∀ i = 0, 1, . . .

– Example: +1 in s1
+10 in s7
0 in others
MRP:Markov Reward Process

Horizon(H) :
– The horizon H of a Markov reward process is defined as the number of time
steps in each episode (realization) of the process.
– The horizon can be finite or infinite.
– If the horizon is finite, then the process is also called a finite Markov reward
process.

Return(Gt):
– The return Gt of a Markov reward process is defined as the discounted sum of
rewards starting at time t up to the horizon H
– It is given by the following mathematical formula:
MRP:Markov Reward Process

State value function(Vt(s)) :
– The state value function Vt(s) for a Markov reward process and a state s ∈ S is
defined as the expected return starting from state s at time t, and is given by the
following expression


Discount factor(γ):
– if the horizon is infinite and γ = 1 ,then the return can become infinite even if
the rewards are all bounded.
– If this happens, then the value function V(s) can also become infinite.
– Such problems cannot then be solved using a computer.
– To avoid such mathematical difficulties and make the problems computationally
tractable we set γ < 1 .This quantity γ is called the discount factor .
MRP:Markov Reward Process

Discount factor(γ):
– Other than for purely computational reasons, it should be noted that humans
behave in much the same way -

we tend to put more importance in immediate rewards over rewards
obtained at a later time.
– When γ = 0 , we only care about the immediate reward
– When γ = 1 , we put as much importance on future rewards as compared the
present.
– if the horizon of the Markov reward process is nite, i.e. H < ∞ , then we can set
γ = 1 , as the returns and value functions are always finite.
MRP:Markov Reward Process

Example of a MRP: Mars Rover
– Given set of ststes S = {S1,S2,S3,S4,S5,S6,S7}, Discount factor =.5, rewards
r1=1, r2=r3=r4=r5=r6=0,r7 =10, and horizon H=4. Compute the return G0 of
the following episodes.
– S4, S5, S6, S7, S7: G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 10 = 1.25
– S4, S4, S5, S4, S5 : G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 0 = 0
– S4, S3, S2, S1, S2 : G0 = 0 + 0.5 ∗ 0 + 0.5 2 ∗ 0 + 0.5 3 ∗ 1 = 0.125
How to compute the value function of a MRP?

There are three dierent ways to compute the value function of a Markov reward
process:
– Simulation
– Analytic solution
– Iterative solution
How to compute the value function of a MRP?

Monte Carlo simulation:
– Generate a large number(N) of episodes starting with state “s” and time “t”.
– Compute Gt , γ t
– Mean Gt = ΣGt / N
– For a Markov reward process M = (S, P, R, γ), state “s”, time “t”, and the number of
simulation episodes N, the pseudo-code of the simulation algorithm is given as below:
How to compute the value function of a MRP?

Iterative solution for finite horizon:
– Dynamic programming based solution

– For M = (S, P, R, γ))


How to compute the value function of a MRP?

Iterative solution for infinite horizon:
– The algorithm takes as input a Markov reward process M = (S, P, R, γ) , and a tolerance
ε, and computes the value function for all states.

– For both these algorithms , the computational cost of each loop is O(|S|2)
Markov Decision Process(MDP)

It is formally represented using the tuple (S, A, P, R, γ) which are listed below:
– S : A finite state space.
– A : A finite set of actions which are available from each state
– P : A transition probability model that specifies P (s’|s, a) .
– R :A reward function that maps a state-action pair to rewards (real numbers),
i.e. R : S×A → R .
– γ : Discount factor γ ∈ [0, 1] .
Multi-Armed Bandit

It is a classic reinforcement learning problem which signifies the exploration-
exploitation tradeoff dilemma.


K-armed bandit / N-armed bandit


It is a problem in which a limited set of resources must be allocated between
competiting choices in a way that maximizes their expected gain, when each
choice’s properties are only partially known at the time of allocation, and may
become better understand as time passes or by allocating resources to the choice.


It is also categorized as stochastic scheduling.
Multi-Armed Bandit - Application

How must a given budget be distributed among some research departments (where
their outcomes are particularlly known) to maximize results?


How to design effective financial portfolio?


How to design adaptive routing to maximize end-to-end dely, delay jitter,
propagation delay etc. ?


How to execute effective investigation of clinical trials of different experimental
treatments while minimizing patient losses?


How to balance EXPLOITATION vs EXPLORATION tradeoff?
Modeling Multi-Armed Bandit Model (MAB)

It is a set of real-distributions:
– B = {R1,R2,....,Rk} , k∈N+ livers
μ1,μ2,μ3,.....,μk - Mean values associated with these reward distributions.


The gambler iteratively plays one liver per round and observes the associated
reward.


Objective : Maximize the sum of the collected rewards.


Horizon(H): The number of rounds that remain to be played.
Modeling Multi-Armed Bandit Model (MAB)

MAB ≡ One-state Markov Decision Process

: Regret after T rounds


Where : Maximal reward mean =

: is the reward in round t.

Defination (Zero-Regret Strategy): It is a strategy whose average regret per round
, with probability = 1
Bandit Strategies

1. Optimal solutions:
UCBC[Upper Confidence Bounds with Cluster]
– The algorithm incorporates the clustering information by playing at two levels:


First picking a cluster using a UCB -like strategy at each time step


Subsequently picking an arm within the cluster, again using a UCB-like
strategy
Bandit Strategies

2. approximate Solutions:
– Epsilon-greedy strategy
– Epsilon-first strategy
– Epsilon-decreasing strategy
– Adaptive epsilon-greedy strategy based on value differences(VDBE)
– Adaptive Epsilon-greedy strategy based on Baysian ensembles (Epsilon-BMC)
– Contextual-Epsilon-greedy strategy
Contextual Bandit

It is a generalization of the multi-armed bandit.

At each iteration an agent has to choose between arms, by referring the d-
dimensional feature vector.

Approximate solutions are
– Online linear Bandits

LinUCB(Upper Confidence Bandit)

LinRel(Linear Associative Reinforcement Learning)
– Online non-Linear Bandits

UCB...

Generalized Linear algorithm

Neural Bandit algorithm

Kernel UCB algorithm

Bandit forest algorithm

Oracle-based algorithm
Adversarial Bandit

Another variant of multi-armed bandit problem is called the adversial bandit.

Working Principle:

At each iteration, an agent chooses an arm and an adversary simultaneously
chooses the payoff structure for each arm.

Example:(iterated prisoner’s dilema)
– Each adversary has two arms to pull
– They can either deny or confess.
– Standard stochastic bandit algorithms did not work very well with these
iterations
Adversarial Bandit
– Example : if,(the opponent cooperates in the first 10 rounds, defects for the
next 20, then cooperate in the following 30, etc.)

– Then the algorithms such as Upper Confidence Bandit(UCB) won’t be able to


react very quickly to these changes.

– After a certain point sub-optimal arms are rarely pulled to limit exploration
and focus on exploitation

– When the environment changes the algorithm is unable to adapt or may not
even detect the change.


Note: it is one of the strongest generalizations of the bandit problem as it removes
all assumptions of the distribution and a solution to the adversial bandit problem.
Pseudocode: Appropriate solutions of Adversarial Bandit

Input: γ∈(0,1], V∈R

Initialization: wi(1) = 1 ∀ i=1 to k

For each t=1,2,3,...T
1. set , i = 1 to k
2. Draw “it” randomly according to the probabilities p1(t),p2(t),...,pk(t)
3. Reeive reward ∈(0,1]
4. for j=1,2,3.....,k
, if j = it

0 , Otherwise
Pseudocode: Appropriate solutions of Adversarial Bandit


Explanation:
– The algorithm chooses an arm at random with probability (1-γ) it prefers
arms with higher weights - Exploitation

– It chooses another arm with probability “γ” to uniformly randomly - explore


Follow the Perturbed Leader Algorithm(FPL)

Input:

Initialization: Ri(1)=0
for each t = 1,2 ...,T
1. for each arm generate a random noise from an experimental distribution:

2. Pull arm I(t): I(t) =


Add noise to each arm and pull the one with the highest value.
3. Update value:
4. The will be computed as before.
Follow the Perturbed Leader Algorithm(FPL)

Working Priniciple:

Here the agent selects the “best performing arm” by adding exponential noise to it.
i.e. explore the potential of the arm by adding noise or challenges

Approximate Solution FPL

Maintains weights for each arm to Doesn’t need to know the pulling
calculate pulling probability probability for each arm

The standard FPL does not have good


It has efficient theoritical guarantes
theoritical guarentes

It is computionally expensive(Calculation
Computationally efficient
for each arm)

You might also like