0% found this document useful (0 votes)

11 views

16 Boosting

The document discusses ensemble methods for machine learning classification tasks. It describes how bagging creates diverse classifiers by training each on a random sample of the training data with replacement. This results in classifiers with independent errors, reducing the training error of the ensemble's majority vote. Bagging samples the rows of the data matrix. Random forest also samples the columns, selecting a random subset of features for each split, improving variance reduction at the cost of increased bias. Boosting is also mentioned as another ensemble method.

Uploaded by

ketatayosri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

16 Boosting

Uploaded by

ketatayosri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Machine Learning — Statistical Methods for Machine Learning

Boosting and ensemble methods

Instructor: Nicolò Cesa-Bianchi version of June 1, 2022

Ensemble methods are used to form combinations of predictors that achieve a bias-variance trade-
off better than the one achieved by the algorithm generating the predictors in the combination.
Many stochastic gradient descent algorithms, like Pegasos, can be viewed as ensemble methods
because they output a combination of all the predictors generated during the sequential run over
the training set. We now look at ensemble methods that are not based on online algorithms. As
usual, our focus is on binary classification.

Fix a training set (x1 , y1 ), . . . , (xm , ym ) for a binary classification problem with zero-one loss, and
assume an ensemble of classifiers h1 , . . . , hT is available (later, we describe how to obtain the
ensemble). Consider the majority classifier f defined by
T
!
X
f (x) = sgn hi (x) (1)
i=1

Clearly, f is wrong on x if and only if at least half of the classifiers h1 , . . . , hT are wrong on x
(assume T is odd to avoid ties). We now study the conditions under which the majority classifier
achieves a small training error. Assume
T
Y

P h1 (xZ ) 6= yZ ∧ · · · ∧ hT (xZ ) 6= yZ = P hi (xZ ) 6= yZ (2)
i=1

where Z is a random variable uniformly distributed on the set {1, . . . , m} of indices of training
examples. In other words, each classifier is wrong independently of the others with respect to
the uniform distribution over the training set. The indicator functions of these events define the
training error `S (hi ) of each classifier hi . Indeed,
m
1 X
`S (hi ) = I{hi (xt ) 6= yt } = P hi (xZ ) 6= yZ
m
t=1

The majority classifier (1) is a simple example of an ensemble method, combining the predictions
of an ensemble of classifiers in order to boost the accuracy. We now bound the training error of f .
Introduce
T
1X
`ave = `S (hi )
T
i=1
1
Without loss of generality, we can assume that `S (hi ) ≤ 2 for each i = 1, . . . , T . This implies

1
1
àve ≤ 2 and we can write

`S (f ) = P f (xZ ) 6= yZ
T
!
X T
=P I{hi (xZ ) 6= yZ } >
2
i=1
T !
1X 1
=P I{hi (xZ ) 6= yZ } > àve + − àve
T 2
i=1

Now introduce the Bernoulli random variables B1 , . . . , Bm defined by Bi = I{hi (xZ ) 6= yZ }. Note
that these random variables are independent, due to our assumption (2). Also, E[Bi ] = `S (hi ) and
T
1X
E[Bi ] = `ave
T
i=1

Let ε = 12 − `ave ≥ 0. A slight generalization of Chernoff-Hoeffding bounds to independent random

variables with unequal expectations gives
T
!
1X 2
P Bi > `ave + ε ≤ e−2ε T
T
i=1
1
By setting γi = 2 − `S (hi ) > 0 we get
2 !

T
!2 
1 1X  ≤ e−2T γ 2
`S (f ) ≤ exp −2T − `ave = exp −2T γi (3)
2 T
i=1

where for the last inequality we assumed γi ≥ γ > 0 for all i = 1, . . . , T .

This result tells us that if we manage to obtain classifiers with independent training errors in the
sense of (2), then the training error of the majority vote classifier decreases exponentially with
respect to T γ 2 , where γ measures how better than random guessing is each classifier hi on the
training set. Note that being able to reduce the training error on arbitrary datasets implies a
decrease of the bias error.

Bagging. How can we obtain classifiers with independent training errors? A popular heuristic,
known as Bagging, applies to any learning algorithm A for binary classification and to any training
set S. Let m be the size of S. Bagging builds h1 , . . . , hT by drawing m examples uniformly at
random with replacement from S. This process is repeated T times so to obtain the resampled
training sets S1 , . . . , ST . Then A is run on each Si setting hi = A(Si ). The idea is that the
resampling procedure helps enforce condition (2).

One may wonder how different from S any Si can be. To find that out, we take a little detour and
compute the fraction of unique data points in Si . As you see in a moment, more than one third of
the points of S are missing from Si in expectation! Let N be the number of unique points drawn,
and let Xt be the indicator function of the event that (xt , yt ) is drawn. Then the probability that
(xt , yt ) is not drawn is
1 m

P(Xt = 0) = 1 −
m

2
So we have
m m m
1 m 1 m
X X X
E[N ] = E[Xt ] = P(Xt = 1) = 1− 1− =m−m 1−
m m
t=1 t=1 t=1

Therefore, the fraction of unique points in Si is

1 m

1
1− 1− ≈ 1 − = 0.632 . . .
m e

where the approximation becomes exact for m → ∞.

We saw that independence of errors helps reduce the bias by driving the training error to zero.
On the other hand, subsampling of the training set helps reduce the variance. If we think of the
m training points arranged in a m × d matrix (called the data matrix), then what bagging does
is subsampling the rows of this matrix. We now briefly describe another ensemble method that
increases the protection against overfitting by also subsampling the columns of the data matrix.

Random Forest. This ensemble method works by taking a majority vote over an ensemble
h1 , . . . , hT of tree predictors. Similarly to bagging, each tree predictor hi is obtained by running
a learning algorithm over a dataset Si obtained by subsampling the rows of the full data matrix.
However, the algorithm for learning tree predictors does not have direct access to Si . Indeed, when
considering a leaf ` for splitting, instead of being given Si,` (the set of training examples in Si that
are routed to `), the algorithm √ has access to a version of Si,` containing a random subset of the
original features (typically, d features are sampled from the original d features). This additional
sampling provides a better control on the variance at the expense of the bias. Because of its good
performance on many learning tasks, Random Forest is often used as a baseline when testing a new
learning algorithm.

Boosting. We now introduce boosting, a principled ensemble method that achieves the exponential
bound (3) on the training error without requiring the demanding condition (2). Boosting is an
incremental method to build classifiers of the form sgn(f ), where
T
X
f= wi hi
i=1

and w = (w1 , . . . , wT ) is a vector of real coefficients. We assume h1 , . . . , hT are generated by some

learning algorithm A and belong to some family H of base classifiers.

In practice, also in order to save computational costs, base classifiers are very simple. A typical
choice for H is that of decision stumps. These are all classifiers of the form hi,τ : Rd → {−1, 1}
defined by hi,τ (x) = ± sgn(xi − τ ), where i = 1, . . . , d and τ ∈ R.

The specific boosting algorithm we introduce is known as AdaBoost (adaptive boosting). Fix a
training set S with m examples (x1 , y1 ), . . . , (xm , ym ), and a sequence h1 , . . . , hT of base classifiers.
We now show how to choose the coefficients w so that the training error is bounded as in (3).

AdaBoost uses the convex upper bound I{z ≤ 0} ≤ e−z on the zero-one loss function I{f (xt ) yt ≤ 0}.

3
This gives
m m m
1 X 1 X −f (xt )yt 1 X − PTi=1 wi hi (xt )yt
`S (f ) = I{f (xt )yt ≤ 0} ≤ e = e
m m m
t=1 t=1 t=1

Note that other algorithms use different convex upper bounds on the zero-one loss. For example,
SVM uses the hinge loss.

Introduce now the functions L1 , . . . , LT defined by Li (t) = hi (xt )yt . Note that Li (t) ∈ {−1, 1} and
Li (t) = 1 if and only if hi (xt ) = yt . Recalling that Z is a random variable uniformly distributed in
{1, . . . , m}, we can view each Li (Z) as a random variable and write
m
"T #
1 X − PTi=i wi Li (t) h PT
− i=i wi Li (Z)
i Y
−wi Li (Z)
`S (f ) ≤ e =E e =E e
m
t=1 i=1

If condition (2) were true, we could write the expectation of the product as a product of expecta-
tions. In order to sidestep the condition, we change the probability space and write
"T # T
Y Y h i
−wi Li (Z) −wi Li (Zi )
E e = Ei e (4)
i=1 i=1

where each Zi ∈ {1, . . . , m} is distributed according to a law Pi yet to be specified.

Assuming (4) holds, which we verify later, we can proceed as follows

T
Y h i
`S (f ) ≤ Ei e−wi Li (Zi )
i=1
YT
e−wi Pi (Li (Zi ) = 1) + ewi Pi (Li (Zi ) = −1)

=
i=1
T
Y
e−wi (1 − εi ) + ewi εi

= (5)
i=1

where we set
m
def
X
εi = Pi (Li (Zi ) = −1) = I{Li (t) = −1}Pi (t)
t=1
Note that εi is the error of hi with respect to the probability Pi . Namely, εi is the weighted training
error of hi where the weights are determined by Pi .

Before computing the Pi , we show how to pick w1 , . . . , wT in order to minimize (5). By computing
the zeros of the derivative of e−w (1 − εi ) + ew εi with respect to w, we find a single zero at
1 1 − εi
wi = ln .
2 εi
Note that the above expression is only defined for 0 < εi < 1. As we will see, P(t) > 0 for all
t ∈ {1, . . . , m}. Hence εi ∈ {0, 1} implies that either hi or −hi has zero training error on S. If this

4
happens, than we can throw away all hj for j 6= i and avoid using boosting altogether. Therefore,
without loss of generality we may assume 0 < εi < 1 for all i = 1, . . . , m.

Substituting in (5) and simplifying, we get

T p
Y
`S (f ) ≤ 4εi (1 − εi ) .
i=1

Note that wi = 0 if and only if εi = 12 , meaning that the weight (according to Pi ) of the training
points where hi errs is exactly 12 . Because such a hi does not affect the value of f (since εi = 12
implies wi = 0) without loss of generality we may also assume that εi 6= 21 .
def
Set γi = 12 − εi and note that εi 6= 1
2 implies γi 6= 0. Using the inequality 1 + x ≤ ex , which holds
for all x ∈ R, we get
T p T q T PT
2 2 2
Y Y Y
`S (f ) ≤ 4εi (1 − εi ) = 1 − 4γi2 ≤ e−2γi = e−2 i=1 γi = e−2T γ
i=1 i=1 i=1

where in the last step we assumed γi > γ > 0. This is the same bound as the one we proved in (3)
under the condition (2). Note, however, that the definition of γi = 21 − εi changes because εi now
is the weighted training error of hi .

Just like (3), this bound provides a pretty strong control on the bias. Using the observation that
`S (f ) = 0 if and only if `S (f ) < 1/m, we conclude that a number

ln m
T >
2γ 2
of boosting rounds is sufficient to bring the training error of f down to zero.

We now move on to derive the P1 , . . . , PT satisfying condition (4). Setting P1 = P, E1 = E, and

Pi (t)e−wi Li (t)
Pi+1 (t) = for t = 1, . . . , m and i = 1, . . . , T − 1 (6)
Ei e−wi Li (Zi )

where
h i Xm
Ei e−wi Li (Zi ) = e−wi Li (s) Pi (s)
s=1

It is easy to check that P1 , . . . , PT are indeed probability distributions on {1, . . . , m}. In particular,
Pi (t) > 0 and Pi (1) + · · · + Pi (m) = 1.

For this choice of Pi we can prove (4) as follows. First, we solve (6) for e−wi Li (t) obtaining
h i P (t)
i+1
e−wi Li (t) = Ei e−wi Li (Zi )
Pi (t)

5
Then, we write
T m T
" # !
Y 1 X Y h −wi Li (Zi ) i Pi+1 (t)
E e−wi Li (Zi ) = Ei e
m Pi (t)
i=1 t=1 i=1
m T
Y !
1 X PT +1 (t) h i
= Ei e−wi Li (Zi )
m P1 (t)
t=1 i=1
m
! T !
X PT +1 (t) Y h i
−wi Li (Zi )
= P1 (t) Ei e
P1 (t)
t=1 i=1
T
Y h i
= Ei e−wi Li (Zi )
i=1

concluding the proof.

These probability distributions have a simple interpretation when one studies how Pi+1 depends
on Pi . Fix Pi and suppose εi < 12 . Then wi > 0 and each Pi+1 (t) is obtained multiplying Pi (t) for
the quantity e−wi Li (t) , which is bigger than 1 if and only if hi (xt ) 6= yt . In other words, the weight
of each training example (xt , yt ) is increased when Pi is updated to Pi+1 if and only if hi errs on
(xt , yt ). Intuitively, the boosting process concentrates the weight on the training examples that are
misclassified by the previous classifiers. A similar argument applies to the case when εi > 21 .

Input: Training set S of examples (x, y) ∈ Rd × {−1, 1}.

Learning algorithm A.
Maximum number T of boosting rounds.
Initialize P1 (t) ← 1/m for t = 1, . . . , m.
For i = 1, . . . , T

1. Feed A with S weighted by Pi and get hi in response

2. Compute εi for hi

3. If εi ∈ 0, 12 , 1 then BREAK

4. Let wi ← 1
2 ln 1−ε
εi .
i

5. Compute Pi+1 using (6).

If for loop exited on BREAK, then deal with the special case
Else output f = sgn(w1 h1 + · · · + wT hT ).

We are now ready to introduce the pseudo-code for AdaBoost. It is convenient to view the boosting
process as sequence of rounds between the boosting algorithm and the learning algorithm A.

hi
B A
Pi

6
In each round i, the booster B gives Pi to A and gets hi in response. If A return hi such that
εi = 12 , then the boosting process stops and the booster outputs f = sgn(w1 h1 + · · · + wi−1 hi−1 ).
If A return hi such that εi ∈ {0, 1}, then the boosting process also stops, but in this case booster
outputs f = hi or f = −hi .

cs229 Notes Ensemble
No ratings yet
cs229 Notes Ensemble
7 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Machine Learning: Ensemble Methods
No ratings yet
Machine Learning: Ensemble Methods
54 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
09_EnsembleLearning
No ratings yet
09_EnsembleLearning
36 pages
Lecture 2.1 - AML
No ratings yet
Lecture 2.1 - AML
32 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
AIML Lect6 Ensembles
No ratings yet
AIML Lect6 Ensembles
41 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Lecture 10 Ensemble Methods
No ratings yet
Lecture 10 Ensemble Methods
69 pages
Ensembles 1
No ratings yet
Ensembles 1
4 pages
Ensemble Methods
No ratings yet
Ensemble Methods
31 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
Ensembles of Classifiers: Evgueni Smirnov
No ratings yet
Ensembles of Classifiers: Evgueni Smirnov
43 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
boosting
No ratings yet
boosting
28 pages
Combining Classifiers: Outline
No ratings yet
Combining Classifiers: Outline
15 pages
Lec 29
No ratings yet
Lec 29
33 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Ensemble Methods (Final)
No ratings yet
Ensemble Methods (Final)
16 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Ensemble Methods
100% (1)
Ensemble Methods
15 pages
lecture slide 12
No ratings yet
lecture slide 12
22 pages
Machine-Learning Research: Four Current Directions
No ratings yet
Machine-Learning Research: Four Current Directions
40 pages
Improving Regressors Using Boosting Techniques: Observations, XX
No ratings yet
Improving Regressors Using Boosting Techniques: Observations, XX
9 pages
Lec06 - Ensembling Methods Bagging Boosting
No ratings yet
Lec06 - Ensembling Methods Bagging Boosting
48 pages
MLDM Lect17 Classification Ensembles
No ratings yet
MLDM Lect17 Classification Ensembles
2 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
1729585037_ML11_Generalization
No ratings yet
1729585037_ML11_Generalization
40 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
ML8Ensembles (1)
No ratings yet
ML8Ensembles (1)
31 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
Friedman - PREDICTIVE LEARNING VIA RULE ENSEMBLES
No ratings yet
Friedman - PREDICTIVE LEARNING VIA RULE ENSEMBLES
39 pages
Ensemble Classification
No ratings yet
Ensemble Classification
25 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
Lecture 17 - Ensemble Learning
No ratings yet
Lecture 17 - Ensemble Learning
31 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
8 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
VTU Module-4 Chapter-2 Ensemble Learning and Random Forests
No ratings yet
VTU Module-4 Chapter-2 Ensemble Learning and Random Forests
61 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
3 pages
G
No ratings yet
G
7 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Week 11 EnsembleLearning
No ratings yet
Week 11 EnsembleLearning
34 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
DA UNIT 4
No ratings yet
DA UNIT 4
26 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Birhane Gemeda, Shear Capacity Prediction
No ratings yet
Birhane Gemeda, Shear Capacity Prediction
108 pages
10 1016@j Compeleceng 2020 106729
No ratings yet
10 1016@j Compeleceng 2020 106729
17 pages
Al3451 Ml - Questionbank -3,4,5
No ratings yet
Al3451 Ml - Questionbank -3,4,5
11 pages
Naïve Bayes-DecisionTrees-RandomForest-SVM
No ratings yet
Naïve Bayes-DecisionTrees-RandomForest-SVM
26 pages
Cheating Chits
No ratings yet
Cheating Chits
1 page
CS3491-Artificial-Intelligence-and-Machine-Learning-Two-Mark-Questions-1
No ratings yet
CS3491-Artificial-Intelligence-and-Machine-Learning-Two-Mark-Questions-1
23 pages
Immediate download Ultimate Machine Learning with Scikit Learn 1st Edition Parag Saxena ebooks 2024
100% (5)
Immediate download Ultimate Machine Learning with Scikit Learn 1st Edition Parag Saxena ebooks 2024
61 pages
IAS_2611
No ratings yet
IAS_2611
11 pages
Cornell CS578: Bagging and Boosting
No ratings yet
Cornell CS578: Bagging and Boosting
10 pages
Python Notes
No ratings yet
Python Notes
16 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
Dav Yeshiva University
No ratings yet
Dav Yeshiva University
37 pages
10 38016-Jista 922663-1720006
No ratings yet
10 38016-Jista 922663-1720006
7 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
BTP Sixth Sem Report
No ratings yet
BTP Sixth Sem Report
31 pages
CSE PROJECTS 2025
No ratings yet
CSE PROJECTS 2025
13 pages
ML Unit 3 r20 Jntuk
No ratings yet
ML Unit 3 r20 Jntuk
22 pages
Ensemble Learning
100% (1)
Ensemble Learning
7 pages
CSE472_Assignment_2
No ratings yet
CSE472_Assignment_2
3 pages
(Ebook) Ensemble Methods for Machine Learning - MEAP Version 6 by Gautam Kunapuli ISBN 9781617297137, 1617297135 - Own the ebook now and start reading instantly
100% (3)
(Ebook) Ensemble Methods for Machine Learning - MEAP Version 6 by Gautam Kunapuli ISBN 9781617297137, 1617297135 - Own the ebook now and start reading instantly
75 pages
Sri Ram Project Phase 1 Report
No ratings yet
Sri Ram Project Phase 1 Report
36 pages
10.1515 - Rams 2024 0006
No ratings yet
10.1515 - Rams 2024 0006
18 pages
CP 4
No ratings yet
CP 4
2 pages
Ensemble Learning: Wisdom of The Crowd
100% (1)
Ensemble Learning: Wisdom of The Crowd
12 pages
Crop Yield Prediction Using ML Algorithms
No ratings yet
Crop Yield Prediction Using ML Algorithms
8 pages
Efficient Spam Detection Technique For Iot Devices Using Machine Learning
No ratings yet
Efficient Spam Detection Technique For Iot Devices Using Machine Learning
11 pages
EPGP in Data Science Gen AI PDF
No ratings yet
EPGP in Data Science Gen AI PDF
63 pages
Artificial Intelligence3-1
No ratings yet
Artificial Intelligence3-1
43 pages

16 Boosting

Uploaded by

16 Boosting

Uploaded by

Machine Learning — Statistical Methods for Machine Learning

Boosting and ensemble methods

Let ε = 12 − `ave ≥ 0. A slight generalization of Chernoff-Hoeffding bounds to independent random

where for the last inequality we assumed γi ≥ γ > 0 for all i = 1, . . . , T .

Therefore, the fraction of unique points in Si is

where the approximation becomes exact for m → ∞.

and w = (w1 , . . . , wT ) is a vector of real coefficients. We assume h1 , . . . , hT are generated by some

where each Zi ∈ {1, . . . , m} is distributed according to a law Pi yet to be specified.

Assuming (4) holds, which we verify later, we can proceed as follows

Substituting in (5) and simplifying, we get

We now move on to derive the P1 , . . . , PT satisfying condition (4). Setting P1 = P, E1 = E, and

concluding the proof.

Input: Training set S of examples (x, y) ∈ Rd × {−1, 1}.

1. Feed A with S weighted by Pi and get hi in response

5. Compute Pi+1 using (6).

You might also like