0% found this document useful (0 votes)
42 views

Generalization Error: Elie Kawerk

The document discusses ensemble learning techniques for reducing variance in tree-based models. It explains that ensemble methods train multiple models on the same dataset and aggregate their predictions to obtain more robust predictions compared to individual models. Specifically, it describes how a voting classifier works by taking a hard vote of the class predictions from constituent classifiers to obtain the final prediction. An example is provided showing how to implement a voting classifier in scikit-learn that ensembles a logistic regression, KNN, and decision tree model for classification of a breast cancer dataset.

Uploaded by

manish wadhwani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Generalization Error: Elie Kawerk

The document discusses ensemble learning techniques for reducing variance in tree-based models. It explains that ensemble methods train multiple models on the same dataset and aggregate their predictions to obtain more robust predictions compared to individual models. Specifically, it describes how a voting classifier works by taking a hard vote of the class predictions from constituent classifiers to obtain the final prediction. An example is provided showing how to implement a voting classifier in scikit-learn that ensembles a logistic regression, KNN, and decision tree model for classification of a breast cancer dataset.

Uploaded by

manish wadhwani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Generalization Error

MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON

Elie Kawerk
Data Scientist
Supervised Learning - Under the Hood
Supervised Learning: y = f(x), f is unknown.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Goals of Supervised Learning
Find a model f^ that best approximates f : f^ ≈ f
f^ can be Logistic Regression, Decision Tree, Neural Network ...
Discard noise as much as possible.

End goal: f^ should acheive a low predictive error on unseen datasets.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Dif culties in Approximating f
Over tting:f^(x) ts the training set noise.
Under tting: f^ is not exible enough to approximate f .

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Over tting

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Under tting

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Generalization Error
Generalization Error of f^: Does f^ generalize well on unseen data?
It can be decomposed as follows: Generalization Error of
f^ = bias2 + variance + irreducible error

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Bias
Bias: error term that tells you, on average, how much f^ ≠ f .

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Variance
Variance: tells you how much f^ is inconsistent over different training sets.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Model Complexity
Model Complexity: sets the exibility of f^.
Example: Maximum tree depth, Minimum samples per leaf, ...

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Bias-Variance Tradeoff

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Bias-Variance Tradeoff: A Visual Explanation

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Let's practice!
MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON
Diagnosing Bias and
Variance Problems
MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON

Elie Kawerk
Data Scientist
Estimating the Generalization Error
How do we estimate the generalization error of a model?

Cannot be done directly because:


f is unknown,
usually you only have one dataset,

noise is unpredictable.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Estimating the Generalization Error
Solution:

split the data to training and test sets,

tf^ to the training set,


evaluate the error of f^ on the unseen test set.

generalization error of f^ ≈ test set error of f^.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Better Model Evaluation with Cross-Validation
Test set should not be touched until we are con dent about f^'s performance.
Evaluating f^ on training set: biased estimate, f^ has already seen all training points.
Solution → Cross-Validation (CV):
K-Fold CV,

Hold-Out CV.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


K-Fold CV

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


K-Fold CV

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Diagnose Variance Problems
If f^ suffers from high variance: CV error of f^ > training set error of f^.
f^ is said to over t the training set. To remedy over tting:
decrease model complexity,

for ex: decrease max depth, increase min samples per leaf, ...

gather more data, ..

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Diagnose Bias Problems
iff^ suffers from high bias: CV error of f^ ≈ training set error of f^ >> desired error.
f^ is said to under t the training set. To remedy under tting:
increase model complexity

for ex: increase max depth, decrease min samples per leaf, ...

gather more relevant features

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


K-Fold CV in sklearn on the Auto Dataset
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

# Set seed for reproducibility


SEED = 123
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.3,
random_state=SEED)

# Instantiate decision tree regressor and assign it to 'dt'


dt = DecisionTreeRegressor(max_depth=4,
min_samples_leaf=0.14,
random_state=SEED)

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


K-Fold CV in sklearn on the Auto Dataset
# Evaluate the list of MSE ontained by 10-fold CV
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv= 10,
scoring='neg_mean_squared_error',
n_jobs = -1)

# Fit 'dt' to the training set


dt.fit(X_train, y_train)
# Predict the labels of training set
y_predict_train = dt.predict(X_train)
# Predict the labels of test set
y_predict_test = dt.predict(X_test)

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


# CV MSE
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))

CV MSE: 20.51

# Training set MSE


print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))

Train MSE: 15.30

# Test set MSE


print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))

Test MSE: 20.92

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Let's practice!
MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON
Ensemble Learning
MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON

Elie Kawerk
Data Scientist
Advantages of CARTs
Simple to understand.

Simple to interpret.

Easy to use.

Flexibility: ability to describe non-linear dependencies.

Preprocessing: no need to standardize or normalize features, ...

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Limitations of CARTs
Classi cation: can only produce orthogonal decision boundaries.

Sensitive to small variations in the training set.

High variance: unconstrained CARTs may over t the training set.

Solution: ensemble learning.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Ensemble Learning
Train different models on the same dataset.

Let each model make its predictions.

Meta-model: aggregates predictions of individual models.

Final prediction: more robust and less prone to errors.

Best results: models are skillful in different ways.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Ensemble Learning: A Visual Explanation

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Ensemble Learning in Practice: Voting Classi er
Binary classi cation task.

N classi ers make predictions: P1 , P2 , ..., PN with Pi = 0 or 1.


Meta-model prediction: hard voting.

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Hard Voting

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Voting Classi er in sklearn (Breast-Cancer
dataset)
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Import models, including VotingClassifier meta-model


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# Set seed for reproducibility


SEED = 1

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Voting Classi er in sklearn (Breast-Cancer
dataset)
# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size= 0.3,
random_state= SEED)
# Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)

# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr),
('K Nearest Neighbours', knn),
('Classification Tree', dt)]

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
#fit clf to the training set
clf.fit(X_train, y_train)

# Predict the labels of the test set


y_pred = clf.predict(X_test)

# Evaluate the accuracy of clf on the test set


print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

Logistic Regression: 0.947


K Nearest Neighbours: 0.930
Classification Tree: 0.930

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Voting Classi er in sklearn (Breast-Cancer
dataset)
# Instantiate a VotingClassifier 'vc'
vc = VotingClassifier(estimators=classifiers)

# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

# Evaluate the test-set accuracy of 'vc'


print('Voting Classifier: {.3f}'.format(accuracy_score(y_test, y_pred)))

Voting Classifier: 0.953

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON


Let's practice!
MA CH IN E LEA RN IN G W ITH TREE-BA S ED MODELS IN P YTH ON

You might also like