0% found this document useful (0 votes)
17 views

Foundations of Machine Learning: Module 7: Computational Learning Theory

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Foundations of Machine Learning: Module 7: Computational Learning Theory

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Foundations of Machine Learning

Module 7: Computational
Learning Theory
Part A: Finite Hypothesis Space
Sudeshna Sarkar
IIT Kharagpur
Goal of Learning Theory
• To understand
– What kinds of tasks are learnable?
– What kind of data is required for learnability?
– What are the (space, time) requirements of the learning
algorithm.?
• To develop and analyze models
– Develop algorithms that provably meet desired criteria
– Prove guarantees for successful algorithms

2
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)
Pr 𝑃 𝑐⨁ℎ ≤ 𝜖 ≥ 1 − 𝛿

C h

C ⨁h=
Error region3
Prototypical Concept Learning Task
• Given
– 𝑑 𝑑
Instances X (e.g., 𝑋 = 𝑅 or 𝑋 = {0,1}
h 𝑐
+ + −
– Distribution 𝒟 over X - + + -
– Target function c - -
– Hypothesis Space ℋ Instance space X
– Training Examples S = 𝑥𝑖 , 𝑐(𝑥𝑖 ) 𝑥𝑖 i.i.d. from 𝒟
• Determine
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐(𝑥) for all 𝑥 in S?
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐 𝑥 for all 𝑥 in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over 𝒟

4
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.

• Inductive inference –
generalizing beyond the training h 𝑐
data is impossible unless we add + + −
- + + -
more assumptions (e.g., - -
priors over H) Instance space X
We need a bias!
Function Approximation
• How many 𝑁
labeled examples in order to determine which
of the 22 hypothesis is the correct one?
• All 2𝑁 instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)

- 𝐻 = ℎ: 𝑋 → 𝑌
|𝑋| 2𝑁
+ ℎ1 ||H|=2 = 2
c + +
- +
- + -
ℎ2
Instance space X
Error of a hypothesis
The true error of hypothesis h, with respect to the target
concept c and observation distribution 𝒟 is the probability that h
will misclassify an instance drawn according to 𝒟
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ 𝑃𝑟𝑥~𝒟 𝑐 𝑥 ≠ ℎ 𝑥
In a perfect world, we’d like the true error to be 0.

Bias: Fix hypothesis space H


c may not be in H => Find h close to c
A hypothesis h is approximately correct if
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ ≤ 𝜀
PAC model
• Goal: h has small error over D.
• True error: 𝑒𝑟𝑟𝑜𝑟𝐷 ℎ = Pr ℎ 𝑥 ≠ 𝑐 ∗ 𝑥
𝑥~𝐷
• How often ℎ 𝑥 ≠ 𝑐 ∗ 𝑥 over future instances drawn at
random from D
• But, can only measure:
1
Training error: 𝑒𝑟𝑟𝑜𝑟𝑆 ℎ = ෍ 𝐼(ℎ 𝑥𝑖 ≠ 𝑐 ∗ 𝑥 )
𝑚 𝑖
How often ℎ 𝑥 ≠ 𝑐 ∗ 𝑥 over training Instances

• Sample Complexity: bound 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) in terms of 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ)


Probably Approximately Correct Learning

• PAC Learning concerns efficient learning


• We would like to prove that
– With high probability an (efficient) learning algorithm will
find a hypothesis that is approximately identical to the
hidden target concept.

• We specify two parameters, 𝜀 and 𝛿 and


require that with probability at least (1−𝛿) a
system learn a concept with error at most 𝜀.
Sample Complexity for Supervised Learning
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈
𝐻 with 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) > 0.
• inversely linear in 𝜖
• logarithmic in |H|
• 𝜖 error parameter: D might place low weight on certain parts of the
space
• 𝛿 confidence parameter: there is a small chance the examples we
get are not representative of the distribution
Sample Complexity for Supervised
Learning
1 1
Theorem: 𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛 labeled examples are sufficient so that
∈ 𝛿
with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) > 0.
Proof: Assume k bad hypotheses Hbad={ℎ1 , ℎ2 , … , ℎ𝑘 } with
𝑒𝑟𝑟𝐷 (ℎ𝑖 ) ≥∈
• Fix ℎ𝑖 . Prob. ℎ𝑖 consistent with first training example is ≤ 1 −
∈. Prob. ℎ𝑖 consistent with first m training examples is ≤
(1 −∈)𝑚 .
• Prob. that at least one ℎ𝑖 consistent with first m training
examples is
≤ 𝑘(1 −∈)𝑚 ≤ |𝐻|(1 −∈)𝑚 .
• Calculate value of m so that |𝐻|(1 −∈)𝑚 ≤ 𝛿
• Use the fact that 1 − 𝑥 ≤ 𝑒 −𝑥 , sufficient to set |𝐻|𝑒 −∈𝑚 ≤ 𝛿
Sample Complexity: Finite Hypothesis
Spaces Realizable Case
PAC: How many examples suffice to guarantee small error whp.
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.

Statistical Learning Way:


With probability at least 1 − 𝛿, all ℎ ∈ 𝐻 s.t. 𝑒𝑟𝑟𝑆 ℎ = 0 we have
1 1
𝑒𝑟𝑟𝐷 (ℎ) ≤ | |
𝐼𝑛 𝐻 + 𝐼𝑛
𝑚 𝛿
P(consist( H bad , D))  H e m  
m 
e 
H

 m  ln( )
H
  
m    ln  /  (flip inequality )

 H 
 H 
m   ln /

  
 1 
m   ln  ln H  / 
  
Sample complexity: inconsistent finite |ℋ|
• For a single hypothesis to have misleading training error
−2𝑚𝜀2
Pr 𝑒𝑟𝑟𝑜𝑟𝒟 𝑓 ≤ 𝜀 + 𝑒𝑟𝑟𝑜𝑟𝐷 𝑓 ≤ 𝑒
• We want to ensure that the best hypothesis has error
bounded in this way
– So consider that any one of them could have a large error
−2𝑚𝜀 2
Pr (∃𝑓 ∈ ℋ)𝑒𝑟𝑟𝑜𝑟𝒟 𝑓 ≤ 𝜀 + 𝑒𝑟𝑟𝑜𝑟𝐷 𝑓 ≤ |ℋ|𝑒
• From this we can derive the bound for the number of
samples needed.
1 1
𝑚 ≥ 2 (ln ℋ + ln( ))
2𝜀 𝛿
Sample Complexity: Finite Hypothesis Spaces

Consistent Case
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.

Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability ≥ 1 − 𝛿, all ℎ ∈ 𝐻 have
𝑒𝑟𝑟𝐷 ℎ − 𝑒𝑟𝑟𝑆 (ℎ) <∈, for
2 2
𝑚≥ 2
𝐼𝑛 |𝐻| + 𝐼𝑛
2∈ 𝛿
Sample complexity: example
• 𝒞 : Conjunction of n Boolean literals. Is 𝒞 PAC-learnable?
|ℋ| = 3𝑛
1 1
𝑚 ≥ (𝑛 ln 3 + ln( ))
𝜀 𝛿

• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a PAC concept is:
1 2𝑛 1 1 𝑛 1
𝑚 ≥ (ln 2 + ln( )) = (2 ln 2 + ln( ))
𝜀 𝛿 𝜀 𝛿

• δ=ε=0.05, n=10 gives 14,256 examples


• δ=ε=0.05, n=20 gives 14,536,410 examples
• δ=ε=0.05, n=50 gives 1.561x1016 examples

17
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

• Hypothesis Representation: Conjunction of constraints on the


6 instance attributes
• “?” : any value is acceptable
• specify a single required value for the attribute
• “” : that no value is acceptable
Concept Learning

h = (?, Cold, High, ?, ?, ?)


indicates that Aldo enjoys his favorite sport on
cold days with high humidity

Most general hypothesis: (?, ?, ?, ?, ?, ? )


Most specific hypothesis: (, , , , , )
Find-S Algorithm
1. Initialize h to the most specific hypothesis in ℋ
2. For each positive training instance x
For each attribute constraint ai in h
IF the constraint ai in h is satisfied by x
THEN do nothing
ELSE replace ai in h by next more general
constraint satisfied by x
3. Output hypothesis h
Concept Learning
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

Finding a Maximally Specific Hypothesis


Find-S Algorithm
h1  (, , , , , )
h2  (Sunny, Warm, Normal, Strong, Warm, Same)
h3  (Sunny, Warm, ?, Strong, Warm, Same)
h4  (Sunny, Warm, ?, Strong, ?, ?)

Back
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A
Sudeshna Sarkar
IIT Kharagpur
Sample Complexity: Infinite
Hypothesis Spaces
• Need some measure of the expressiveness of infinite
hypothesis spaces.
• The Vapnik-Chervonenkis (VC) dimension provides
such a measure, denoted VC(H).
• Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
Shattering
• Consider a hypothesis for the 2-class problem.
• A set of 𝑁 points (instances) can be labeled as + or
− in 2𝑁 ways.
• If for every such labeling a function can be found in
ℋ consistent with this labeling, we set that the set
of instances is shattered by ℋ.
Three points in R2
• It is enough to find one set of three points that can be
shattered.
• It is not necessary to be able to shatter every possible set of
three points in 2 dimensions
Shattering Instances
• Consider 2 instances described using a single real-
valued feature being shattered by a single
interval.

x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z

x y z + –
_ x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
z x,y
Cannot do x,z y

7
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the largest
finite subset of X shattered by H. If arbitrarily large finite
subsets of X can be shattered then VC(H) = 

• If there exists at least one subset of X of size d that can be


shattered then VC(H) ≥ d.
• If no subset of size d can be shattered, then VC(H) < d.

• For a single intervals on the real line, all sets of 2 instances can
be shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.

• Since there are 2m partitions of m instances, in order for H


to shatter instances: |H| ≥ 2m.
• Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|

9
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.

Some 4 instances cannot be shattered:


VC Dimension Example (cont)
• No five instances can be shattered since there can be at most
4 distinct extreme points (min and max on each of the 2
dimensions) and these 4 cannot be included without including
any possible 5th point.

• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.

11
Upper Bound on Sample Complexity with VC

• Using VC dimension as a measure of expressiveness, the


following number of examples have been shown to be
sufficient for PAC Learning (Blumer et al., 1989).
1 2  13  
 4 log 2    8VC ( H ) log 2   
     

• Compared to the previous result using ln|H|, this bound has


some extra constants and an extra log2(1/ε) factor. Since
VC(H) ≤ log2|H|, this can provide a tighter upper bound on
the number of examples needed for PAC learning.

12
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that 𝑉𝐶 𝐻 > 2 , any learner 𝐿
and any 0 < 𝜀 < 1Τ8 , 0 < 𝛿 < 1Τ100 .
Then there exists a distribution D and target concept in C such that if
L observes fewer than:
1  1  VC(C )  1 
max  log 2  , 
   32 
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the upper
bound except for the extra log2(1/ ε) factor in the upper bound.
13
Thank You
Foundations of Machine Learning

Module 8: Ensemble Learning


Part A

Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of
70%.
– If all n give the same class output then you can be confident
it is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using
the majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)

n! n -r
P(r) = p (1 - p)
r

r!(n - r)!
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight ∝ accuracy
• weight ∝ 1Τvariance
L
y = åw jd j
j=1
L
w j ³ 0 and åw j =1
j=1

• Bayesian
(
P Ci |x = ) å
all models M j
( )( )
P Ci |x , Mj P Mj
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each
hypothesizes is not always possible.
BMA

• All possible models in the model space used


weighted by their probability of being the “Correct”
model
• Optimal given the correct model space and priors
Why are Ensembles Successful?
• Bayesian perspective:
PC i | x    PC | x ,M PM 
allmodelsMj
i j j

• If dj are independent
 1  1   1
Var  y   Var   d j   2 Var   d j   2 L  Var d j   Var d j 
1
  
 j L  L  j  L L

Bias does not change, variance decreases by L


• If dependent, error increase with positive correlation
  1 
Vary   2 Var  d j   2  Vard j   2 Cov(di , d j )
1
L  j  L  j j i j 
Challenge for developing Ensemble Models

• The main challenge is to obtain base models which are


independent and make independent kinds of errors.
• Independence between two base classifiers can be assessed
in this case by measuring the degree of overlap in training
examples they misclassify
(|AB|/|AB|)
Thank You
Foundations of Machine Learning

Module 8: Ensemble Learning


Part B: Bagging and Boosting

Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement

• Build classifier on each bootstrap sample

• Each sample has probability (1 – 1/n)n of being


selected
Boosting
• An iterative procedure. Adaptively change distribution of
training data.
– Initially, all N records are assigned equal weights
– Weights change at the end of boosting round
• On each iteration t:
– Weight each training example by how incorrectly it was
classified
– Learn a hypothesis: ℎ𝑡
– A strength for this hypothesis: 𝛼𝑡
• Final classifier:
– A linear combination of the votes of the different
classifiers weighted by their strength
• “weak” learners
– P(correct) > 50%, but not necessarily much better
Adaboost
• Boosting can turn a weak algorithm into a strong
learner.
• Input: S={ 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) }
• 𝐷𝑡 (𝑖) : weight of i th training example
• Weak learner A
• For 𝑡 = 1,2, … , 𝑇
– Construct 𝐷𝑡 on {𝑥1 , 𝑥2 …}
– Run A on 𝐷𝑡 producing ℎ𝑡 : 𝑋 → {−1,1}
𝜖𝑡 =error of ℎ𝑡 over 𝐷𝑡
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where 𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
– Choose 𝛼𝑡 ∈ ℝ.
– Update:
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝐷𝑡 + 1 𝑖 =
𝑍𝑡
Where 𝑍𝑡 is a normalization factor
𝑚

𝑍𝑡 = ෍ 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝑖=1
Output the final classifier:
𝑇

𝐻 𝑥 = 𝑠𝑖𝑔𝑛 ෍ 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
Choose 𝛼𝑡 to minimize training error
– Choose 𝛼𝑡 ∈ ℝ.
1 1−∈𝑡
– Update: 𝛼𝑡 = 𝐼𝑛
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) 2 ∈𝑡
𝐷𝑡 + 1 𝑖 =
𝑍𝑡 where
Where 𝑍𝑡 is a normalization factor 𝑚
𝑚

𝑍𝑡 = ෍ 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) ∈𝑡 = ෍ 𝐷𝑡 𝑖 δ(ℎ𝑡 𝑥𝑖 ≠ 𝑦𝑖 )


𝑖=1 𝑖=1
Output the final classifier:
𝑇

𝐻 𝑥 = 𝑠𝑖𝑔𝑛 ෍ 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Strong weak classifiers
• If each classifiers is (at least slightly) better than random
∈𝑡 < 0.5

• Ican be shown that AdaBoost will achieve zero training


error (expotentially fast):

𝑚 𝑇
1
෍ 𝛿(𝐻(𝑥𝑖 ) ≠ 𝑦𝑖 ) ≤ ෑ 𝑍𝑡 ≤ 𝑒𝑥𝑝 −2 ෍(1Τ2 −∈𝑡 )2
𝑚
𝑖=1 𝑡 𝑡=1
Illustrating AdaBoost
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744

Overall +++ - - - - - ++
Thank You

You might also like