0% found this document useful (0 votes)
91 views

2017 Machine Learning Summary v4 PDF

This document is a summary of machine learning techniques from 2017 that was originally authored in 2014 and updated in 2017. It covers topics such as version spaces, decision trees, evaluating learning models, and more. It provides definitions and explanations of key concepts like classification tasks, inductive bias, overfitting, decision tree algorithms, and performance evaluation metrics. The summary also notes techniques for dealing with issues like continuous attributes, missing data, and attributes with many values.

Uploaded by

Paula Gitu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

2017 Machine Learning Summary v4 PDF

This document is a summary of machine learning techniques from 2017 that was originally authored in 2014 and updated in 2017. It covers topics such as version spaces, decision trees, evaluating learning models, and more. It provides definitions and explanations of key concepts like classification tasks, inductive bias, overfitting, decision tree algorithms, and performance evaluation metrics. The summary also notes techniques for dealing with issues like continuous attributes, missing data, and attributes with many values.

Uploaded by

Paula Gitu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning Summary 2017

Pieter Schaap (2014), updated by Andrew Gold (2017)


March 13, 2018

Contents
1 Lecture 1: Version Spaces 5
1.1 Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Conjunction of Discrete Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Find s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 List elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Boundary Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Candidate Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8.1 Picking training instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.2 Unanimous-Voting rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.3 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.4 Unanimous Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Volume Extension Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.1 In Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 K-Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Lecture 2: Decision Trees 9


2.1 Decision Trees for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Now when can we use decision trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Decision Tree Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Hypothesis Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Inductive Bias in ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Overfitting, Underfitting, and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Causes of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Identifying Overfitness, Underfitness, and Optimality . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.5 Growing Set vs Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.6 Reduced-Error Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.7 Rule Post-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.8 Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.9 Reduction of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.10 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Dealing with continuous attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Oblique Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Attributes with Many Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Missing Attribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1
3 Lecture 3: Evaluation of Learning Models 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Evaluation of Classifiers Evaluation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Confidence Intervals for Estimates on Classification Performance . . . . . . . . . . . . . . . . . 16
3.2.4 Metric Evaluation TL;DR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Comparing Data-Mining Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Counting the Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Generating a Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 ROC Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Iso-Accuracy Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.3 Contructing ROC Curve for 1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.4 Area Under Curve Metric (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Lecture 4: Bayesian Learning 18


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Maximum a Posteriori Hypothesis (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Useful Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Brute Force MAP hypothesis learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Minimum Description Length Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Bayes Optimal Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.8 Gibbs Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.9 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Lecture 5: Linear Regression 20


5.1 Supervised Learning: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.1 Regression versus Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Cost function intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3.1 Least Squares Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.1 Choosing Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.2 Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Normal Equation vs Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 Finding the ”right” model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Lecture 6: Logistic Regression and Artificial Neural Networks 25


6.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.1 Sigmoid Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.2 Non-Linear Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Gradient Descent for Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Multi-Class Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.2 Learning The Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.3 Properties Of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2
7 Lecture 7: Recommender Systems 28
7.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Content Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3.1 Collaborative Filtering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3.2 Mean Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.1 Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.2 Non-Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.3 Logistic Regression to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.5 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.5 Compare SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Lecture 8: 31
8.1 Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.3 Lazy vs Eager Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.4 Inductive vs Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.5 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.6 Distance Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.7 Normalization of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.8 Weighted Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.9 More distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Distance-weighted kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2.1 Edited k-nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Pipeline Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.4 kD-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Local Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.6 Comments on k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.7 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8 Sequential Covering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8.1 Candidate Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.2 Sequential covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.9 Example-driven Top-down Rule induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.10 Avoiding over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9 Lecture 9: Clustering 37
9.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4 Flat vs. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.5 Extensional vs Intensional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.6 Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.7 Major Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.1 Dendogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.2 Bottom up Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.9 Distance between two clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3
10 Lecture 10: 39
10.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.2 Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3 Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.1 Q-Learning Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.2 Learning the Q-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.3 Q-Learning Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.4 Accelerating the Q-Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.5 Q-Learning Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.4 Online Learning and SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.5 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4
1 Lecture 1: Version Spaces
Version space learning is a logical approach to machine learning, specifically binary classification. Version space
learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the
hypothesis space is a disjunction:
• H1 ∨ H2 ∨ ... ∨ Hn

(i.e., either hypothesis 1 is true, or hypothesis 2, or any subset of the hypotheses 1 through n). A version space
learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x,
the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space
is called the candidate elimination algorithm (see 1.8), the hypothesis space maintained inside the algorithm its
version space.

Overview
• Classification Task
• FindS algorithm

• Version Spaces
• List Elimination Algorithm
• Boundary Sets and Candidate Elimination Algorithm
• Properties of Version Spaces

• Inductive Bias
• Version Spaces and Consistency Tests
• Volume Extension and k-Version Spaces

1.1 Classification Task


• Class is a set of objects with the same appearance structure or function.
• Elements are aspects of one (or more) objects.

• Classifiers are a set of elements that indicate that an object belongs to a certain class.
• The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be
returned by it (as being true).

So a classification task consists out of 4 properties: X, Y, H and D where: X:= The version space Y:= The evaluation
of an object in the version space (done by H) H:= The hypothesis space D:= The training data
Binary Classification task; e.g. |Y | = 2 Multi-Class Classification task; e.g. |Y | >= 2

1.2 Learning Classifiers


Essentially a search in the hypothesis space where the goal is to find a hypothesis to best fit the training data D. If this
hypothesis is consistent with a sufficiently large set of training data it will give a good approximation of other unob-
served instances. Consistency Criterion: Hypothesis h is consistent with D ⇔ h(x) = y f or each instance (x, y) in D
When ordering the hypothesis from ”General” to ”Specific” the following logic can be applied:
(∀h1, h2 ∈ H)((h1 ≥ h2) ⇔ (∀x ∈ X)(h1(x) = 1 ⇐ h1(x) = 1))

5
1.3 Conjunction of Discrete Attributes
How to generalize a hypothesis (h) with respect to an instance (x)?
For every attribute Ai in the hypothesis h where Ai is specified and contradicts the instance x: Set Ai of h to ?
(unspecified).
And how do we make it more specific? First we create an empty set that we call the specializations. Assuming
that the instance x is a positive object;
For every attribute value v of Ai of h that is not specified (=?) we create a specialization s that is equal to h and
set the value of attribute Ai of s to v. We then set the specializations set to be the union of itself and s. (end for)

1.4 Find s algorithm


initialize s to the most specific hypothesis in H.
For every training instance x we check if x is positive; if so, we generalize s against x. if it is not we check if s(x) =
1 (equals true) and if that is the case we stop.

1.5 Version Spaces


Definition: The version space VS(D) for the training data D is the set of all the consistent hypotheses in H. or in
mathematical notation:
V S(D) = {h ∈ H|consistent(h, D)}
The classification rule of a version space is the unanimous voting rule i.e. every hypothesis must evaluate object
x as true in order to be classified.

1.6 List elimination Algorithm


More commonly known as the ”List-then-eliminate algorithm”. Basically takes a list of all the Hypotheses in H and
for every training instance it removes every hypothesis from the list that is not consistent with the training instance.

1.7 Boundary Sets


Two types:

• Minimal Boundary Set (Most specific Set)


• Maximal Boundary Set (Most General Set)
In essence this means that if an hypothesis space H is admissible (fits) the training data D, then there exists an
hypothesis s in the minimal boundary set and an hypothesis g in the maximal boundary set such that for every
hypothesis in H the following holds: s ≤ h ≤ g
or more formally: (∀h ∈ H)((h ∈ V S(D)) ⇐⇒ (∃s ∈ S(D))(∃g ∈ G(D))(s ≤ h ≤ g)).

1.8 Candidate Elimination Algorithm


The candidate elimination algorithm incrementally builds the version space given a hypothesis space H and a set E
of examples. The examples are added one by one; each example possibly shrinks the version space by removing the
hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the
general and specific boundary for each new example.

• Candidate Elimination Algorithm(X,Y,E,H)


– Inputs:
∗ X: set of input features, X=X1,...,Xn
∗ Y: target feature
∗ E: set of examples from which to learn
∗ H: hypothesis space
– Output:
∗ general boundary GH

6
∗ specific boundary SH consistent with E
– Local
∗ G: set of hypotheses in H
∗ S: set of hypotheses in H
– Let G={true}, S={false};
1. for each e ∈ E do:
(a) if (e is a positive example) then compare e to Gi−1 (of the previous example).
i. Elements of G that classify e as negative are removed from G;
ii. Each element g in Gi−1 that contradicts with the same element in example e is removed from
the new general set G for example e.
iii. Non-maximal hypotheses are removed from S;
(b) else if (e is a negative example) then compare e to S of previous example:
i. Elements of S that classify e as positive are removed from S;
ii. Each element s of Si−1 that contradicts with the same element in the negative example e goes
into a new general set G where the contradicting element is the only specific element, and all
other elements are marked with a ?. If there are multiple elements e that contradict with the
same element in S, a new general set G is made. All contradictions get their own set G with
only ?’s and the single contradicting element.
∗ Each new general set is bound to the specific contradiction of the previous S.
∗ Then we eliminate from the new S (belonging to ei ) the negative elements in e that align
with the specific set S from the previous example.
iii. Non-minimal hypotheses are removed from G.

More elaborate explanation: http://artint.info/html/ArtInt_193.html

The candidate elimination algorithm converges to a correct description if:

• there are no errors in the training data.


• When the classifier of the target class is H (i have no idea what he means)

1.8.1 Picking training instances


When picking the next training instance the learner should request instances that correspond to exactly half of the
descriptions in the Version Space. Therefore the description of the target concept can be found with log2 |V S| number
of instances.

1.8.2 Unanimous-Voting rule


• Definition 1: This basically means that both (upper and lower) boundaries should agree on whether a training
instance is true or false and do not contradict the training instance. (true if the training instance is true, false
if the training instance is false).
• Definition 2: Given version space VS(D), an instance x ∈ X receives a classification VS(D)(x) defined as follows:

y : V S(D) 6= ∅ ∧ (∀h ∈ V S(D))y = h(x)
V S(D)(x) =
”?” : Otherwise.

• Definition 3: Volume V(VS(D)) of version space VS(D) is the set of all instances that are not classified by
VS(D).

7
1.8.3 Inductive Bias
Completeness of a version space: Version Space = complete ↔ for any dataset D there exists a hypothesis in H s.t.
H is consistent with D
Now the inductive Bias of Version Spaces is the assumption that a version space is incomplete! So when do we
speak of a correct inductive bias? Well, that is when the target hypothesis t is in the hypothesis space H and the
training data are noise free (all fields are known and are correct). According to the internet: Inductive Bias = The
assumption that the target concept is contained in the hypothesis space (doesn’t this contradict the slides? (above
statement))
However, upon reviewing this with someone we concluded that it is possible that the inductive bias is simply
the set of rules that we found from inductive learning over the training data which can then be used to classify new
instances. WE THINK!

1.8.4 Unanimous Voting


Theorem: For any instance x ∈ X and class y ∈ Y :
(∀h ∈ V S(D))(h(x) = y) ↔ (∀y 0 ∈ Y \ {y})V S(D ∪ {(x, y 0 )}) = ∅. In other words what this says is that for every
hypothesis in the version space that classifies instance x as class y it holds that for every other class x cannot be
classified as true. or in other other words: The Theorem states the unanimous-voting rule can be implemented if we
have an algorithm to test version spaces for collapse.
Unanimous voting can be used to determine whether we can correctly classify an instance.

1.8.5 Accuracy
So when can we reach 100% accuracy and when not? Well there are 3 cases:
• Case 1: Data is noise free and the hypothesis space H contains the target classifier. (100% accuracy)

• Case 2: The hypothesis space H does not contain the target classifier and thus we do not know for sure which
class the instance has.
• Case 3: The training data contains noise. Therefore we cannot be certain if we are classifying correctly.

1.9 Volume Extension Approach


The volume-extension approach is a new approach to overcome the problems with noisy training data and inexpressive
hypothesis spaces. If a version space V S(I + , I ) ⊆ H misclassifies instances, the approach is to find a new hypothesis
space H0 s.t. the volume of version space V S0 (I + , I ) ⊆ H0 grows and blocks instance misclassifications.
Theorem: Consider hypothesis space H and H’ such that:
(∀D)((∃h ∈ H) that is consistend(h,D) then: (∃h0 ∈ H 0 ) that is consistent(h’,D)) as well. Then, for any data set
D: V (V S(D)) ⊆ V (V S 0 (D))

1.9.1 In Practice
• Case 2: H does not contain the target classifier. The solution in this case is to add a classifier that classifies
the instance differently than the classifiers in VS(D). In other words, we extend the volume of VS(D)
• Case 3: When the datasets are noisy. The solution is again to add a classifier that classifies the instances
differently than the classifiers in VS(D) and we extend the volume of VS(D) again.

1.10 K-Version Spaces


k-Version spaces were introduced to handle noisy data. They were defined as sets of k-consistent hypotheses; i.e.
hypotheses consistent with all but k instances. Definition 1: Given a classifier space H and training data D, the
k-version space V Sk (D) is:
V Sk (D) = {h ∈ H|consistenk (h, D)},
where
consistentk (h, D) ↔ (∃Dk ⊆ P k(D))(∀(x, y) ∈ Dk )(y = h(x))
Theorem: if k2 > k1 then, for any data set D:
V (V Sk1 (D))T ODOV (V Sk2 (D))

8
2 Lecture 2: Decision Trees
Overview
Decision Trees for Classification

• Definition
• Classification Problems for Decision Trees
• Entropy and Information Gain

• Learning Decision Trees


• Overfitting, Underfitting, and Pruning
• Validation Set vs Growing Set
• Handling Continuous-Valued Attributes

• Handling Missing Attribute Values


• Alternative Measures for Selecting Attributes
• Handling Large Data: Windowing

2.1 Decision Trees for Classification


Definition: A decision tree is a tree where:
• Each interior node tests an attribute of some data set

• Each branch corresponds to an attribute value


• Each leaf node is labeled with a class (class node) of the data

2.1.1 Now when can we use decision trees?


Each instance consists of an attribute with discrete values; e.g. weather forecast = sunny or weather forecast =
rainy. The classification has to happen over discrete values (true or false; yes or no; 0 or 1 etc.) Decision trees can
have disjunctive descriptions; i.e. a path in a tree represents a disjunctive description. If the training set contains
errors or missing data then the decision tree is robust enough to deal with this.

2.1.2 Decision Tree Learning


There is a basic algorithm for learning a decision tree:
1. A ← the ”best” decision attribute for a node N.
2. Assign A as decision attribute for the node N.

3. For each value of A, create new descendant of the node N.


4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified, THEN STOP.
ELSE iterate over new leaf nodes

So in short what it does is for every decision attribute (e.g. weather forecast) create a child node and ”list” all
instances that apply to the leaf node. If it is all classified correctly (no leaf node contains a true and a false at the
same time).

9
2.1.3 Entropy
Basically what entropy does is calculate the impurity of the training data.
• E(S) = −p+ log2 p+ p− log2 p−

where S is a sample of the training data, p+ refers to the proportion of the positive training instances and p− to the
negative. This brings us to information gain:

2.1.4 Information Gain


Information gain is basically the expected reduction in entropy if a certain attribute A is selected to generate the
new leaf nodes. One can compute the information gain using the following formula:
P |Sv |
• Gain(S, A) = E(S) − |S| E(Sv )
v∈V alues(A)

where Sv = {s ∈ S|A(s) = V }, or the set of all samples s in S and A(s) are the attributes of sample S.

2.2 ID3 Algorithm


In informal terms, the ID3 Algorithm does:

• Determine the attribute with the highest information gain on the training set.
• Use this attribute as the root, create a branch for each of the values the attribute can have.
• For each branch repeat the process with subset of the training set that is classified by that branch.

2.2.1 Hypothesis Space


Hypothesis space = set of all decision trees defined over given set of attributes. ID3 guarantees a complete hypothesis
space; meaning that the target description is in the hypothesis space. it basically does a simple-to-complex hill
climbing search through this space where the evaluation function is the information gain. It only expands over 1
current decision tree; meaning that it only expands over 1 node in the previous decision tree and does not backtrack.
Note that ID3 uses the entire dataset at each step of the search.

2.2.2 Inductive Bias in ID3


(From Wikipedia) The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions
that the learner uses to predict outputs given inputs that it has not encountered. In machine learning,
one aims to construct algorithms that are able to learn to predict a certain target output. To achieve this, the
learning algorithm is presented some training examples that demonstrate the intended relation of input and output
values. Then the learner is supposed to approximate the correct output, even for examples that have not been shown
during training. Without any additional assumptions, this problem cannot be solved exactly since unseen situations
might have an arbitrary output value. The kind of necessary assumptions about the nature of the target function
are subsumed in the phrase inductive bias.
A classical example of an inductive bias is Occam’s Razor, assuming that the simplest consistent hypothesis
about the target function is actually the best. Here consistent means that the hypothesis of the learner yields correct
outputs for all of the examples that have been given to the algorithm.
Approaches to a more formal definition of inductive bias are based on mathematical logic. Here, the inductive bias
is a logical formula that, together with the training data, logically entails the hypothesis generated by the learner.
Unfortunately, this strict formalism fails in many practical cases, where the inductive bias can only be given as a
rough description (e.g. in the case of neural networks), or not at all.
When ”choosing” the inductive bias from the ID3 search we have some preferences on picking it.
• we prefer short trees

• we prefer trees with high information gain attributes near the root.
Note that the bias is not a restriction on the hypothesis space but a preference to some hypotheses.

10
2.3 Overfitting, Underfitting, and Pruning
Overfitting is the concept where a model contains more parameters than the data can reasonably suggest, or in
simpler terms: your models are learning too much from noise, and interpreting noise as actually meaningful data.
Therefore, overfit statistical models can suggest things that aren’t true, because it has learned too much from noise.
Overfitting generally happens when you have too many adjustable parameters than what would be optimal, or
more simply by being more complicated than necessary. Therefore, your model may ”learn” from noise a specific
example and assume that is actually an important characteristic, when in fact it was merely an outlier. Overfitting
can be avoided by being as general as possible, and then furthermore by finding some form of average between an
overfit and underfit model.
In science the principle of Occam’s Razor is the concept that the simplest solution is often the best or ”most
correct.” Essentially: ”Do not make things more complicated than necessary”. This view is also often used in machine
learning. When working with decision trees this holds as well. Big (complex) decision trees shelter the thread of
over-fitting. The bigger the tree, the bigger the risk of over-fitting.

2.3.1 Causes of Overfitting


(from Wikipedia) Overfitting is especially likely in cases where learning was performed too long or where training
examples are rare, causing the learner to adjust to very specific random features of the training data, that have
no causal relation to the target function. In this process of overfitting, the performance on the training
examples still increases while the performance on unseen data becomes worse.
Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known
data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from
the fact that information from all past experience can be divided into two groups: information that is relevant for the
future and irrelevant information (”noise”). Everything else being equal, the more difficult a criterion is to predict
(i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is
determining which part to ignore. A learning algorithm that can reduce the chance of fitting noise is called robust.
• Noisy training data.
• Small number of instances are associated with leaf nodes. (coincidental regularities may occur that are unrelated
to target concept).

2.3.2 Avoiding Overfitting


• Pre-pruning: Stop the tree from growing before it matches the training data perfectly.
– When to stop? (difficult) Some of the solutions:
∗ Stop when the number of leaf nodes becomes less than M training instances.
∗ Use a Validation Set: a set of instances used to evaluate the utility of nodes in decision trees.
Usually the training data is randomly split into a growing set and a validation set. The set must be
chosen in a manner that it is unlikely to have the same errors as the growing set. For example see
Reduced-Error Pruning further on in this document.
• Post-pruning: Allow the tree to over-fit, then tweak the tree afterwards. Can also couple an overfit model(s)
with underfit model(s) and finding some form of average between the two.

2.3.3 Underfitting
Underfitting occurs when a statistical model or machine learning algorithm cannot adequately capture the underlying
structure of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the
model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and
low bias). It is often a result of an excessively simple model.

2.3.4 Identifying Overfitness, Underfitness, and Optimality


• Overfitness:
– When performance on training data increases while performance on unseen data/testing data decreases.
The training data is being learned while the unseen data is being misclassified. On a graph it can also be
identified by a wide gap between the training data’s accuracy vs the testing data accuracy.

11
• Underfitness:
– When performance is poor (error is high, accuracy is low) on both the training AND unseen/testing data.
The model is too generic, and it is not learning enough, leading to poor performance all around. Can be
identified on a graph by seeing low accuracy rates for both sets of data.

• Optimality:
– When performance on both training data and the unseen/testing data follows a very similar pattern,
meaning that something that is affecting the training data is also affecting unseen data, leading to the
conclusion that something else besides model fitness is at play.

2.3.5 Growing Set vs Validation Set


In making a decision tree, we can split the data into two sets: the Growing Set and the Validation Set. When
creating these sets, we (randomly or via some heuristic) remove some examples from the overall data set and put
them into a validation set. We then use the remaining examples as the growing set. The validation set is evaluated
and used as a metric to inform the model when constructing predictions for future models.
When the validation set is of a sufficient size (dependent on the specific model, difficult to generalize here) we
can get sufficient results from the decision tree. However, there are a few things to take into account.

1. As the validation set grows, the growing set shrinks, and vice-versa.
2. If the validation set is too small, it can make extremely general inferences on the data it contains, which it
then uses to inform the decision tree which can lead to an overly-pruned and too-small decision tree.
3. If the validation set is too large, it can lead to an under-pruned and too-large decision tree, leading to inefficiency
when making the decision tree.
4. The size of the validation set is subjective relative to the data, and is often best ”played around with” in order
to generate the most efficient results, measured by other metrics such as relative error rates.

2.3.6 Reduced-Error Pruning


One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with
its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive,
reduced error pruning has the advantage of simplicity and speed.
• Sub-tree replacement So for pruning a decision node d we do the following:

1. Remove the sub-tree that has node d as root.


2. d is a leaf node now.
3. assign d the most common classification of the training instances associated with d. I.e. see if it is more
likely that at this point the class is true or false and use that as the new leaf node.

We do the above until further pruning is harmful: Evaluate impact on validation set for each node that can be
pruned and remove the sub-tree that most improves validation set accuracy.
• Sub-tree raising
1. Remove the sub-tree that has the parent of node d as root.
2. Place d at the place of its parent
3. Sort the training instances associated with the parent of d using the sub-tree with rood d.
Then again evaluate if the accuracy of the tree on the validation set has increased.

12
2.3.7 Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.

3. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent
instances.

So for converting into rules we do the following: Start at the root node; for every path to a leaf node we create a
rule using AND operators. Then for every rule try to prune it independently (see if you can achieve higher accuracy
by removing conditions in the rule).

2.3.8 Impurity
Impurity: The diversity of training instances. A high impurity means that of every class there is an equal amount of
instances. A low impurity means that every instance is of the same class. More formally we can describe impurity
as follows: Let S be a sample of training instances; pj the proportions of instances of class j (j=1, ..., J) in S. An
impurity measure ( I(S) ) must satisfy the following:
• I(S) is minimum only when pi = 1 and pj = 0f orj 6= i (all objects are of same class)
1
• I(S) is maximum only when pj = J (there is exactly the same number of objects of all classes)
• I(S) is symmetric with respect to p1 , ..., pJ

2.3.9 Reduction of impurity


Basically the best split is the split that expects to decrease the impurity the most. This expected decrease in impurity
can be calculated as follows:
∆I(S, A) = I(S) − ( |S
P a|
|S| I(Sa ))
a
Where Sa is the subset of objects from S s.t. A=a ∀∆I is called a score measure or a splitting criterion.

2.3.10 Gini Index


Another way of measuring impurity is the Gini index. It measures how often a randomly chosen element from the
set wouldP
be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. i.e.
I(LS) = pj (1 − pj )
j

2.4 Dealing with continuous attributes


2 solutions:
1. Pre-discretize, e.g. Cold if temp < 10 degrees Celsius.
2. Discretize during tree growing
Now the problem is to find out where to make the ”cut point” during discretization. We cut at the point with
the highest information gain (highest impurity decrease (∆I))

2.5 Oblique Decision Trees


Rather than testing just 1 attribute some test conditions may involve multiple attributes. This allows more expressive
representation. However finding the optimal test condition is computationally expensive.

13
2.6 Attributes with Many Values
If attributes have a lot of values this poses 2 problems:
1. No good splits: they fragment the data too quickly, leaving insufficient data at the next level.

2. High reduction of impurity


However we also have 2 solutions:
1. Add a penalty to attributes with many values when applying the splitting criterion.
2. Consider only binary splits.

2.6.1 Gain Ratio


Inf oGain(S,A)
One of these ways of applying a penalty is the Gain Ratio. GainRatio = SplitInf ormation(S,A) But this method is
not flawless; the gain ratio favours unbalanced tests.

2.7 Missing Attribute Values


Another problem that we will come across is missing attribute values. there are a few strategies to deal with this:
• Assign the most common value of A among other instances belonging to the same concept.

• If node n tests the attribute A, assign most common value of A among other instances sorted to node n.
• If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are
estimated based on the observed frequencies of the values of A. These probabilities are used in the information
( |S v|
P
gain measure (via info gain) ( |S| E(Sv ))
v∈V alues(A)

2.8 Windowing
Lastly if we don’t have enough memory to fit all the training data in we can use a technique named windowing:
1. Select randomly n instances from the training data D and put them in window set W.
2. Train a decision tree DT on W.

3. Determine a set M of instances from D misclassified by DT.


4. W = W ∪ M
5. IF Not(StopCondition) THEN Go to 2;

14
3 Lecture 3: Evaluation of Learning Models
Overview
• Motivation
• Metrics for Classifier’s Evaluation
• Methods for Classifier’s Evaluation
• Comparing Data Mining Schemes
• Costs in Data Mining
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves

3.1 Motivation
Why evaluate classifier’s generalization performance (how good is the classifier in practice)
• Determine whether to employ classifier. I.e.: When using a limited data set for training we need to know how
accurate the classifier is in order to determine whether we can deploy the classifier)
• Optimization purposes. E.g. When post pruning, the accuracy must be determined on every pruning step.

3.2 Evaluation of Classifiers Evaluation Performance


3.2.1 Confusion Matrix
Basically a matrix that visualises the correctly and incorrectly identified classes. It makes a distinction between True
Positive, True Negative (both correct) and false positive and false negative (Both incorrect). i.e.:
Predicted Class
Positive Negative
Actual Class Positive True Positive False Negative
Negative False Positive True Negative

3.2.2 Metrics
There are various metrics to evaluate a classifier:
T P +T N
• Accuracy = P +N = Ratio of correctly classified instances
F P +F N
• Error = P +N = Ratio of incorrectly classified instances
TP
• Precision = T P +F P = Ratio of correctly positively classified instances
TP
• Recall/TP rate (TPR) = P = Ratio of correctly classified positive instances
FP
• FP Rate (FPR) = N = Ratio of incorrectly classified negative instances
So to which data can we apply these metrics? Before we start we need to define stratification:
When stratificating data make sure that each class is represented with approximately equal proportions. This is a
more advanced version of balancing the data.
• Training Data (Not a good indicator because training data are not a good performance indicator for future
data)
• Independent test data (Requires plenty of data and a natural way to forming training and test data)
• Hold-out method (Data is split in training and test data usually 2/3 and 1/3 respectively. However if the data
is unbalanced samples may not be representative, e.g. few or no instances of a certain class)

15
• Repeated hold-out method (More reliable than regular hold-out method due to the fact that it repeats the
process with randomly selected different sub-samples possibly with stratification. But this method does not
avoid overlapping test data nor does it guarantee that all instances are used at least once)
• k-fold cross-validation method (Split data into k equally sized stratified subsets then each subset is used for
testing and the remainder for training. The metric estimates are averaged to yield an overall estimate. Standard
method = 10-fold stratified cross-validation. 10-fold gives best results, stratification reduces estimate’s variance.
Further improvement: Repeated 10-fold stratified cross-validation reduces the estimate’s variance even further)
• Leave-one-out cross-validation (number of folds = number of training instances. Makes best use of the data
BUT computationally expensive. Involves no random sub-sampling. Does not allow stratification. Worst case
scenario: data set split equally into 2 classes: 50% accurate on fresh data but estimated error is 100%)

• Bootstrap method aka 0.632 bootstrap (Cross-validation, but with replacement. Idea: take n samples (size
1) of a dataset with replacement to create a training set. Instances from original dataset the don’t occur
in the new training set are used for testing. Probability of instance ending up in test data = e−1 = 0.368
i.e. test data ≈ 36.8% of instances ⇔ training data ≈ 63.2%. requires special error estimation: error =
0.632 ∗ etestinstances + 0.368 ∗ etraininginstances where ex is the error of subset x. Repeat process several times
with different replacement samples and average the results.)
• And many more

3.2.3 Confidence Intervals for Estimates on Classification Performance


If the test data contains more than 30 examples drawn independently
q of each other: then with approximately N%
errors (h)(1−errors (h))
probability, errorD (h) lies in the interval errors (h) ± ZN ∗ n
N% 50% 68% 80% 90% 95% 98% 99%
where errors (h) = estimated error and errorD (h) is the ac-
ZN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
tual error

3.2.4 Metric Evaluation TL;DR


Data size: large medium small
Favourable method: test sets cross validation leave-one-out
hold-out bootstrap
Also, don’t use test data for parameter tuning, use separate validation data instead.

3.3 Comparing Data-Mining Classifier


Intuition says: train & test using cross validation or bootstrap and rank classifier according to performance. However
we don’t make things easy, do we?

3.3.1 Counting the Costs


Different classification errors come at different costs. e.g. terrorist profiling, loan decisions etc. In some case you
prefer false positives in some cases you prefer false negatives. From this one can create a so called Cost matrix:
Actual → Positive Negative
Hypothesis ↓
The cost of TP and TN are usually set to 0.
Positive TP Cost FN Cost
Negative FP Cost TN Cost
Now we can talk about Cost-Sensitive Classification.

3.3.2 Cost-Sensitive Classification


If a classifier outputs probabilities for each class, we can adjust it to minimize the expected costs of the predictions.
Meaning that if we falsely classify we do it at the least possible cost.
The Expected cost is computed as the dot product of the vector of class probabilities and the appropriate column
in the cost matrix.
There are some simple methods for cost sensitive learning:

16
• Re-sampling of instances according to costs
• Weighting of instances according to costs.

3.4 Lift Charts


In practice decisions are made by comparing possile scenarios and taking into account different costs. In order to
deal with this we generate lift charts.

3.4.1 Generating a Lift Chart


What we do is, we sort instances to probability of true positive. And then we can draw a graph with on the x-axis
the sample size and on the y axis the number of true positives.

3.5 ROC Curves


An ROC curve describes the rates of True Positive Rate (TPR) (y-axis) versus the False Positive Rate (FPR) (x-axis).
With this information you can also extract the rates of False negative (1-y) and true negative (1-x). A convex curve
means that there is a good separation between classes. Concavities indicate that there is poor separation between
the classes.
ROC curves and lift charts can be used for internal optimizationV of classifiers.
A classifier A dominates a classifier B ⇔ T P RA > T P RB F P RA < F P RB
If certain classifiers lie on a diagonal line in a ROC-space (i.e. the rates are equal 0.5 = 0.5), this means that the
TPR = FPR. In this case, since P = N, we have:
(T P R∗P )+(T N R∗N )
• P +N =
(T P R∗P )+((1−F P R)∗N )
• = P +N (because TNR = 1-FPR)
(T P R∗P )+((1−T P R)∗N )
• = P +N (because FPR = TPR in this case)
(T P R∗(P −N ))+N
• = P +N (because P = N in this case)
N
• = P +N (because T P R ∗ (P − N ) = 0)

3.5.1 ROC Convex Hull


Also denoted as ROCCH and is determined by the dominant classifiers. Classifiers that are on the ROCCH achieve
the best accuracy and classifiers below the ROCCH are always sub-optimal. Any performance on a line segment
connecting two ROC points can be achieved by randomly choosing between them. The classifiers on ROCCH can be
combined to form a hybrid.

3.5.2 Iso-Accuracy Lines


Iso accuracy lines are lines that denote that same accuracy over the ROC space. This means that if it connects 2
ROC points they have the same accuracy. Iso accuracy lines have the slope N
P . Higher iso-accuracy lines are better
(higher as in higher accuracy/true positive rate).

3.5.3 Contructing ROC Curve for 1 Classifier


1. Sort instances on probability of being positive
2. move a threshold on the sorted instances.
3. For each threshold define a classifier with confusion matrix.
4. Plot the True positive rate and the False positive rate of the classifiers.

3.5.4 Area Under Curve Metric (AUC)


The area under the curve assesses the separation of the classes. A high area under the ROC curve means that there
is a good separation. The area under the curve estimates that randomly chosen positive instances will be ranked
before randomly chosen negative instances.

17
4 Lecture 4: Bayesian Learning
4.1 Introduction
• Each observed training instance can incrementally decrease or increase the estimated probability that a hy-
pothesis is correct.
• Prior knowledge is combined with observed data to determine the final probability of a hypothesis.
• Bayesian methods accomodate hypotheses that make probabilistic predictions (e.g. 93% chance of recovery)
• Instances are classified by combining predictions of multiple hypotheses, weighted by their probabilities.
• Requires initial knowledge of many probabilities.
• High computational cost.
• Is a standard for optimal learning.

4.2 Bayes Theorem


Goal: Determine the final probability of hypothesis h given the data D from:
• Prior probability of h, P(h): background knowledge about chance that h is correct regardless of observed
data.
• Prior probability of D, P(D): probability that training data D will be observed without knowledge about
which hypothesis h holds.
• Conditional Probability of observation D, P (D | h): probability of observing data D given some world
in which hypothesis h holds.
Now our goal was the Posterior probability of h: P (h | D) i.e. probability that h holds given training data D.
The Bayes theorem allows us to compute P (h | D)!
P (h | D) = P (D|h)P
P (D)
(h)

4.3 Maximum a Posteriori Hypothesis (MAP)


The Maximum a Posteriori Hypothesis is the most probable hypothesis. i.e. the hypothesis h in the hypothesis space
that has the highest P (h | D).

4.4 Useful Formulas


• Product Rule: P (A ∧ B) = P (A | B)P (B) = P (B | A)P (A)
• Disjunction Rule: P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
n
P
• Theorem of Total Probability: P (B) = P (B | Ai )P (Ai )
i=1

4.5 Brute Force MAP hypothesis learner


Boils down to: Calculate posterior probability (P (h | D)) for every hypothesis. Then pick the hypothesis with the
highest probability.

4.6 Minimum Description Length Principle


This is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to
the best compression of the data.
Given the hypothesis, this principle maximizes the prior probability of the product of P (D|h) ∗ P (h):
• hM AP = ArgM axP (D|h) ∗ P (h)
• = ArgM ax(Log2 P (D|h) + Log2 P (h))
• = ArgM in(−Log2 P (D|h) − Log2 P (h))

18
4.7 Bayes Optimal Classifier
Another problem is the following: Given data D, hypothesis space H, and a new instance x, what is the most probable
classification of x? It is not the most probable hypothesis in H. The Bayes optimal classifier assigns to an instance
the classification cj that has the maximum posterior probability P (cj | D). Now the maximum posterior probability
P (cj | D) is calculated using the theorem for total probability. It is calculated using all the hypotheses weighted by
their posterior probabilities w.r.t. the data D:
vOB = arg maxcj ∈{+,−} P P (cj | D)
= arg maxcj ∈{+,−} P (cj | hi )P (hi | D)
hi ∈H
Best classification method according to its average accuracy. However the bayes optimal classifier may not be in
the hypothesis space!

4.8 Gibbs Classifier


1. Choose hypothesis at random according to P (h | D)

2. Use this hypothesis to classify new instance

Actual error: E[errorGibbs ] ≤ 2E[errorBayesOptimal ]

4.9 Naı̈ve Bayes Classifier


Given attributes a ∈ A and values v ∈ V, calculate the maximum probability for values Vi such that:

Q
• vM AP = arg max P (Vj ) P (ai | vj )
i

It assumes that attributes are conditionally independent!

To estimate the probability P (A = v | C) of an attribute-value A = v for a given class C we use:


• Relative Frequency: i.e. nnC where nC is the number of instances that belong to class C and have value v
for the attribute A, and n is the number of training instances of the class C.
nc +mp
• M-estimate of accuracy n+m where p is the prior probability of P (A = v | C) and m is the weight of p.
We take the normalized probability of the outcomes of the above, and the one with the higher probability is the
one that is classified as positive.

19
5 Lecture 5: Linear Regression
TL;DR
Linear regression is the act of trying to define a function Y given an input vector X based on the values x ∈ X that
best describe the patterns of X. We usually do this by finding the least-square error between data point x and the
approximated function Y, or by minimizing a penalized version of the least squares loss function.

5.1 Supervised Learning: Regression


Linear regression models the relationship between a scalar dependent variable y and one or more explanatory variables
(or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For
more than one explanatory variable, the process is called multiple linear regression.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model param-
eters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of y
given the value of X is assumed to be an affine function of X ; less commonly, the median or some other quantile of
the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis,
linear regression focuses on the conditional probability distribution of y given X .

5.1.1 Regression versus Classification


When do we consider a problem as a classification or regression problem? A classification problem is for identifying
individual cases (true or false, 0 or 1) whereas regression problems deal with predicting (continuous) amounts/values
for products.

5.2 Linear Regression


Given a training set of values (vector) X, apply a learning algorithm and try to learn a hypothesis h, represented as
a linear function where h is a function that maps x values to y results:
• y = hΘ (x) = Θ0 x0 + Θ1 x1 ...Θn xn for every decision variable xi where ( 0 ≤ i ≤ n).
Except how do we calculate the parameters Θ?

5.3 Cost function intuition


We want to choose Θ0 , Θ1 such that hΘ (x) is close to y for our training examples (x, y). The idea behind the cost
function is that we want to minimize the total distance between the (estimation) line and the training data. When
minimizing the cost, we often normalize by m so that we can view the cost function as an approximation of the
”generalization error,” or the expected square loss on a randomly chosen new example. Put more simply, we are
minimizing the error rate instead of the total error. For models with 1 variable:

• Hypothesis:
– hΘ (x) = Θ0 + Θ1 x
• Parameters:
– Θ0 , Θ1
• Cost Function J(Θ0 , Θ1 ):
m
1
(hΘ (x(i) ) − y (i) )2 , where:
P
– J(Θ0 , Θ1 ) = 2m
i=1

– hΘ (x(i) ) − y (i)

is the minimized difference between the calculated result and the actual test data. To find out optimal values for
the parameters Θ0 and Θ1 we want to minimize the difference between the calculated result and the actual result of
our test data.
We attach the coefficient 12 to prevent the square 2 from having an effect on the resulting derivative. We also
divide by the number of summands m to get the average cost per data point.

20
The error measure in the cost function is a ”statistical distance”; in contrast to the popular and preliminary
understanding of distance between two vectors in Euclidean space. With statistical distance we are attempting to
map the ”dis-similarity” between estimated model and optimal model to Euclidean space.
There is no constricting rule regarding the formulation of this statistical distance, but if the choice is appropriate
then a progressive reduction in this ’distance’ during optimization translates to a progressively improving model
estimation. Consequently, the choice of ’statistical distance’ or error measure is related to the underlying data
distribution.

21
5.3.1 Least Squares Error
Given a collection of data points (xi , yi ) once you have your hypothesis h for some Θ, your least squares error of h
on a single data point Θ is:
• (hΘ (xi ) − yi )2
1
If we sum up the errors for all Θ, we multiply by 2 to prevent the square 2 from having an effect on the derivative,
resulting in the total error:
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2
i=1

We also divide the total error by the number of summands m to get the average error per data point, giving
1
us the resulting coefficient of 2m .
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2m
i=1

When comparing performance on two data sets of different size, the raw sum of squared errors are not directly
comparable because larger data sets tend to lead to higher error totals. When you normalize, you can compare
the average error per data point.

5.4 Gradient descent


Gradient Descent is a very well-known algorithm for finding maxima and minima, however it can get stuck in localities
(local minimum). It is used in all sorts of optimization problems, not just regression. It is relatively simple compared
to other more sophisticated techniques, yet is still useful.
Gradient Descent is an iterative algorithm for finding max/min of a function:
1. Start with some Θ0 , Θ1
2. Keep updating Θ0 , Θ1 to reduce J(Θ0 , Θ1 ) until you (hopefully) reach a minimum.
Mathematically:
1. hΘ (x) = Θ0 + Θ1 x
m
1
(hΘ (x(i) ) − y (i) )2
P
2. J(Θ0 , Θ1 ) = 2m
i=1
m
1
(hΘ (x(i) ) − y (i) )2 ) (repeat until convergence)
P
3. Θ0 := Θ0 − α( 2m
i=1

where m is the no. of data points and α is the learning rate. (Usually pre-defined)

5.4.1 Choosing Learning Rate


We don’t want the learning rate α to be too small or too big:
• Too small: Slow convergence
• Too big: gradient step may overshoot (and thus we do not converge, leading to an endless loop)

5.4.2 Multiple Features


Gradient descent can also be used for multivariate linear regression, where the cost function would be:
• J(Θj ) = hΘ (x) = Θ0 + Θ1 x1 + Θ2 x2 + ... + Θn xn
and the gradient descent algorithm would look like this:
Repeat until converged :
m
1 (i)
(hΘ (x(i) ) − y (i) )xj
P
1. Θj := Θj − α m
i=1

NOTE: Simultaneously update every Θj ! Only after updating ALL Θ’s should you update hΘ (x)!

22
5.5 Normal Equation
5.5.1 Feature Scaling
With feature scaling we get all features in the [-1, 1] range. Basically what we do is we standardize the range of
independent variables or features of data, because scaling ensures that if some feature values are large it won’t lead
to them being used as a main predictor. This may optimize performance for the gradient descent algorithm and is
known as the normal equation.

5.5.2 The Algorithm


The normal equation is performed as follows:
1
• we’ll have to minimize for Θ: 2 [XΘ − y]T [XΘ − y] which effectively boils down to:
• X T XΘ − X T y and then setting the gradient to zero:
• X T XΘ = X T y from which follows that:
• Θ = (X T X)−1 X T y - note: the -1 means matrix inversion here.

5.6 Normal Equation vs Gradient Descent


• Gradient Descent
– Need to choose α
– needs many iterations
– works well even when the number of features is large
• Normal Equation
– No need for α
– No need to iterate
– Needs to compute (X T X)−1
∗ O(n3 )
∗ might be non-invertible

5.7 Finding the ”right” model


There are 2 problems that we are facing, overfitting and underfitting. This can be solved by one of the following:
1. Reducing the number of features.
• Manually select which features to keep
• Model selection algorithm
2. Regularization
• Keeps all the features but reduces the magnitude of parameters Θj .
• Works well when we have a lot of features, each of which contributes a bit to predicting y.

5.7.1 Regularization
When applying regularization we alter the cost function into the following:
m n n
1
(hΘ (x(i) ) − y (i) )2 + λ Θ2j ] Where the regularization term we add is: λ Θ2j ]
P P P
• J(Θ) = 2m [
i=1 j=1 j=1

Regularization parameter λ is an input parameter to the model. Lambda can be selected by sub-sampling the
data and finding the variation. The value of lambda can reduce overfitting as it increases, however it does
this at the expense of greater bias.

23
For the gradient descent algorithm it would look as follows:
m
1 (i) λ
(hΘ (x(i) ) − y (i) )xj +
P
• Θj := Θj − α[ m m Θj ], then
i=1

m
λ 1 (i)
(hΘ (x(i) ) − y (i) )xj ]
P
• Θj := Θj (1 − α m ) − α[ m
i=1

For the normal equation:


 −1
0 0 0 0 0
0 1 0 0 0 
 
Θ = (X T X + λ 
0 0 1 0 0  X y
 T
0 0 0 ... 0
0 0 0 0 1
Two advantages:

1. Fights over-fitting
2. Guarantees matrix of full rank, and thus invertible

24
6 Lecture 6: Logistic Regression and Artificial Neural Networks
6.1 Logistic Regression
We can cast a binary classification problem into a continuous regression problem. However we can not simply use the
linear regression that we mentioned before. Logistic regression is used when the variable y that we want to predict
can only take on discrete values (i.e. Classification). Considering a binary classification problem (y = 0 or y = 1),
the hypothesis function could be defined so that it is bounded between [0, 1] in which we use some form of logistic
function, such as the Sigmoid Function. Other, more efficient functions exist such as the ReLU (Rectified Linear
Unit), however there are not covered in this course as the sigmoid function is a historical standard.

6.1.1 Sigmoid Logistic Regression


One option is to use a sigmoid function. Why? Because it allows hΘ (x) to only have values between 0 and 1. This
means a more fluent transition is made from false to true.
Sigmoid function:
1
• g(x) = 1+e−z

Now for the hypothesis:


1
• hΘ (x) = g(ΘT x) = 1+e−ΘT x

The decision boundary for the logistic sigmoid function is where hΘ (x) = 0.5 (values less than 0.5 means false,
values equal to or more than 0.5 means true). Another interesting property is that it also gives a chance of the instance
being of that class e.g. hΘ (x) = 0.7 means that there is a 70% chance that the instance is of the corresponding class,
so we get:
• hΘ (x) = g(Θ0 + Θ1 x1 + Θ2 x2 ) and we predict y=1 if:

• −3 + x1 + x2 ≥ 0

6.1.2 Non-Linear Decision Boundaries


Now in the above cases of logistic regression we are speaking of a linear decision boundary (meaning we can draw
a straight line between the class and other instances. However sometimes this is not the case. When dealing with
non-linear decision boundaries we use higher order polynomials in order to be able to classify these cases, e.g.:

• hΘ (x) = g(Θ0 + Θ1 x1 + Θ2 x2 + Θ3 x21 + Θ4 x22 ) and we predict y=1 if:


• −1 + x21 + x22 ≥ 0

6.2 Cost Function


Given a new hypothesis, we now need a cost function. However, just by using the sigmoid function, we end up with
a non-convex cost function. This means that local minima are found must faster than the global minimum, and
leads to slow or incorrect learning.
m
1
P
• J(Θ0 , Θ1 ) = 2m Cost(hΘ , y) where
i=1

• Cost(hΘ , y) is 12 (hΘ (x) − y)2


However, by using the sigmoid function, we end up with a non-convex cost function. This means that local
minima are found must( faster than the global minimum, and leads to slow or incorrect learning.
−log(hΘ (x)), if y = 1
Cost(hΘ (x), y) =
−log(1 − hΘ (x)), if y = 0
This means that the optimization objective function can be defined as the mean of the costs/errors in the
training set:
m
1
Err(hΘ (x(i) , y (i) )
P
• J(Θ) = m
i=1

25
6.3 Gradient Descent for Logistic Regression
How do we find the right Θ parameter value? We use gradient descent!
• Repeat until convergence:
 m 
1 (i)
(hΘ (x(i) ) − y (i) )xj
P
1. Θj := Θj − α m
i−1

NOTE: Simultaneously update all Θj !


1
Looks identical to linear regression but with hΘ (x) = 1+e−ΘT x
and with regularization:
Repeat until convergence:
δ
1. Θj := Θj − α δΘ j
J(Θ), or:
m
δ 1 (i)
(hΘ (x(i) ) − y (i) )x0
P
(a) δΘ0 J(Θ) = m
i=1
m
δ 1 (i) λ
(hΘ (x(i) ) − y (i) )x1 +
P
(b) δΘ1 J(Θ) = m m Θ1
i=1
m
δ 1 (i) λ
(hΘ (x(i) ) − y (i) )x1 +
P
(c) δΘ2 J(Θ) = m m Θ2 ...
i=1

6.4 Multi-Class Problems


Simply make k copies of ”One vs All”. When predicting pick the class with the highest probability (highest outcome
of hΘ ).

6.5 Artificial Neural Networks


Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute
animal brains. Such systems learn (progressively improve performance on) tasks by considering examples, generally
without task-specific programming. An ANN is based on a collection of connected units or nodes called artificial
neurons (analogous to biological neurons in an animal brain). Each connection (synapse) between neurons can
transmit a signal from one to another. The receiving (postsynaptic) neuron can process the signal(s) and then signal
neurons connected to it.
In common ANN implementations, the synapse signal is a real number, and the output of each neuron is calculated
by a non-linear function of the sum of its inputs. Neurons and synapses typically have a weight that adjusts as learning
proceeds. The weight increases or decreases the strength of the signal that it sends across the synapse. Neurons may
have a threshold such that only if the aggregate signal crosses that threshold is the signal sent.
Typically, neurons are organized in layers. Different layers may perform different kinds of transformations on
their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers
multiple times. Alternative architectures include :
• Recurrent Networks (gives memory effect (e.g. counting, adding etc.)
• Multi class Problems

6.5.1 Forward Propagation


With Neural Networks, we’re trying to find a minimum of some certain function, where each neuron is connected to
all other neurons in the previous layer, where the weights in the weighted sum are acting like the strength of each of
those connections. The bias is some indication whether that specific neuron tends to be active or inactive.
(j)
• ai = ”activation” of unit i in layer j:
• Θ(j) = matrix of weights controlling function mapping from layer j to layer j+1. It has dimension sj+1 by(sj +1)
where sj is the number of nodes on layer j

26
so:
(2) (1) (1) (1)
• a1 = g(Θ10 x0 + Θ11 x1 + Θ12 x2 )
(2) (1) (1) (1)
• a2 = g(Θ20 x0 + Θ21 x1 + Θ22 x2 )
(2) (2) (2) (2) (2) (2)
• hΘ (x) = g(Θ10 a0 + Θ11 a1 + Θ12 a2 )

6.5.2 Learning The Weights


Back propagation; uses gradient descent similar to lin. & log. regression. Where do we get errors for internal nodes?
d
It is given that dx g(x) = g(x)(1 − g(x)) and we can back propagate as follows:
Algorithm for learning the weights:
(l)
Training set {((x(1) , y (1) ), ..., (x(m) , y (m) ) } Set ∆ij = 0 (for all l, i, j)
For i = 1 to m {
set a(1) = x(i)
Set a(1) = x(i)
Perform forward propagation to compute a(l) for l = 2,3,...,L
Using y (i) , compute δ (L) = a(L) − y (i)
Compute δ (L−1) , δ (L−2) , ..., δ (2)
(l) (l) (l) (l+1)
∆ij := ∆ij + aj δi
}
(l) 1 (l) (l)
Dij := m [∆ij + λΘij ] if j 6= 0
(l) 1 (l)
Dij := m [∆ij if j = 0
δ (l)
(l) J(Θ) = Dij
δΘij

6.5.3 Properties Of Neural Networks


• Useful for modelling complex, non-linear function of numerical inputs and outputs

– symbolic inputs/outputs represented using some encoding


– 2 or 3 layer networks can approximate a huge class of functions (if enough neurons in hidden layers)
• Robust to noise; but risk of over fitting (due to high expressiveness)! e.g. training for too long. Usually handled
using validation sets.

• All inputs have some effect: Decision trees: selection of most important attribtutes, ANN ”Selects” attributes
by giving them higher/lower weights
• Explanatory power of ANNs is limited
– Model represented as weights in network
– No simple explanation why network makes a certain prediction (cf. trees can give a rule that was used)
– Networks can not easily be translated into a symbolic model (tree, ruleset)

Use ANNs when:


• High dimensional input and output (numeric or symbolic)
• Interpretability of model unimportant

27
7 Lecture 7: Recommender Systems
7.1 Collaborative Filtering
In short what this means is that we look at what other users/customers liked/rated and try to use this information
to recommend other products.

7.2 Content Based Approach


Given a list of films:

Movie Θ(1) Alice(1) Θ(2) Bob(2) Carol(3)


Love at last 5 5 0 0
Romance Forever 5 ? ? 0
Cute puppies of Love ? 4 0 ?
Nonstop car chases 0 0 5 4
Swords vs. karate 0 0 5 ?

• Now the Optimization Criterion: To learn Θ(j) (parameter for user j)


2 n
1 λ (j)
(Θ(j) )T x(i) − y (i,j) (Θk )2
P P
– minΘ(j) 2 + 2
i:r(i,j)=1 k=1

• Now in order to learn all parameters (Θ(1) , Θ(2) , ..., Θ(nu ) :


nu 2
1
(Θ(j) )T x(i) − y (i,j)
P P
– minΘ(1 ),...,Θ(nu ) 2
j=1 i:r(i,j)=1

Note: the 2nd formula combines the knowledge from all users!

• So now we can update the gradient descent algorithm for this case:

!
(j) (j) (i)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α for k = 0
i:r(i,j)=1
!
(j) (j) (i) (j)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α + λΘk for k 6= 0
i:r(i,j)=1

7.3 Collaborative Filtering


Given x(1) , ..., xnm estimate Θ(1) , ..., Θ(nu ) :
nu 2 nu P
n
1 λ (j)
(Θ(j) )T x(i) − y (i,j) (Θk )2
P P P
• minΘ(1) ,...,Θ(nu ) 2 + 2
j=1 i:r(i,j)=1 j=1 k=1

Given Θ(1) , ..., Θ(nu ) estimate x(1) , ..., x(nm ) :

n m 2 n m n
1 λ (i)
(Θ(j) )T x(i) − y (i,j) (xk )2
P P P P
• minx(1) ,...,x(nm ) 2 + 2
j=1 j:r(i,j)=1 j=1 k=1

Estimating x(1) , ..., xnm and Θ(1) , ..., Θ(nu ) simultaneously:

• J(x(1) , ..., x(nm ) , Θ(1) , ..., Θ(nu ) ) =


nm n nu P
n
1 λ (i) λ (j)
((Θ(j) )T x(i) − y (i,j) )2 + (xk )2 + (Θk )2
P P P P
2 2 2
(i,j):r(i,j)=1 i=1 k=1 j=1 k=1

28
7.3.1 Collaborative Filtering Algorithm
1. Initialize the input featuers x(1) , ..., xnm , and weights Θ(1) , ..., Θ(nu ) to small random values.
2. Minimize the cost function J(x(1) , ..., xnm , Θ(1) , ..., Θ(nu )) using gradient descent (or another optimization al-
gorithm).

3. For a user with (learned) parameter Θ and a movie with (learned) features x, predict a star rating of ΘT x.

7.3.2 Mean Normalization


Brand new users will receive a prediction of 0 (not very useful). In order to avoid this we can normalize the
mean. What we do is, we calculate the average rating, and we normalize by subtracting the average rating from the
set of ratings that each existing user has given so far. Then for user j, on movie i predict: (Θ(j) )T (x(i) ) + µi So if
there is no information from a user we give recommendations equal to the average rating!

7.4 Support Vector Machines


Usable in similar situations as neural networks. Important concepts:

• Finding a ”maximal margin” separation.


• Transformation into high dimensional space.

7.4.1 Linear SVMs


The idea is to find a hyperplane that discriminates + from - where the margin/distance of hyperplane to closest
points is maximal. The solution is unique and determined by just a few points (Support vectors).

7.4.2 Non-Linear SVMs


1. Transfrom data to a higher-dimensional space where they are hopefully linearly separable.
2. Learn linear SVM in that space.
3. Transform linear SVM back to original space.

7.4.3 Logistic Regression to SVM


Alternative view on logistic regression:
1 1
• Cost of example: −(y log hΘ (x) + (1 − y) log(1 − hΘ (x))) = −y log 1+e−ΘT x
− (1 − y)log(1 − 1+e−ΘT x
)

This can be done for similar reasons why we would use logistic regression in other classification cases.

7.4.4 Kernels
Pick data points in the space (named landmarks). Idea is that by applying a positive or negative weight to the
distance to a data point/kernel we can predict whether or not a new instance is a class:

• predict y = 1 if:
– Θ0 + Θ1 f1 + Θ2 f2 + ... + Θi fi ≥ 0
• given x:
(i)
||2
– fi = similarity(x, l(i) ) = exp(− ||x−l
2δ 2 ) where l(i) is kernel i

29
7.4.5 Cost Function
Hypothesis: Given x, compute features f ∈ Rm+1 :
• predict ”y=1” if ΘT f ≥ 0

Training:
m n
1
y (i) cost1 (ΘT f (i) ) + (1 − y (i) )cost0 (ΘT f (i) ) + Θ2j
P P
• minΘ C 2
i−1 j=1

7.5 Compare SVM


It is interesting to comapre SVM with:
• Multi-layered Neural Networks:
– Perceptron: linear separation, not with maximal margin.
– ANN obtains better expressiveness by changing representation throughout its layers.
– SVM obtains better expressiveness through non-linear transformation.
• Instance Based Learning:
– SVM stores examples that identify boundary between classes; classification based on which side of the
boundary new example is.
– IBL: stores all examples; classification based on distance to stored examples.

30
8 Lecture 8:
8.1 Nearest Neighbor Algorithm
Idea: Instances that lie ”close” to each other are most likely similar to each other.

8.1.1 Properties
• Learning is very fast
• No info is lost (brings disadvantage: ”Details” may be noisy)
• Hypothesis space:

– Variable size
– Complexity of the hypothesis rises with the number of stored examples

8.1.2 Decision Boundaries


The sample space is basically ”cut” into pieces for all data points. These boundaries are not computed!
So in essence we keep all information. However this comes with the problem that more details means more noise.
In order to improve robustness against noisy learning examples we use a set of nearest neighbors. For classification
we use voting, and for regression we use the mean.
The method in the book contains a mistake, see the slide about the book.

8.1.3 Lazy vs Eager Learning


Lazy learning: Don’t do anything until we need to make a prediction (e.g. Nearest Neighbor)
• Learning is fast

• Predictions require work and can be slow


Eager learning: Start computing as soon as we receive data. (Decision tree, neural networks etc.)
• Learning can be slow
• predictions are usually fast!

8.1.4 Inductive vs Transductive learning


Induction: for input x find a model/function to calculate y.
• Computations take only learning data into account
• a single model must work well for all new data: global model
Transduction: for input x find some output y

• computations can take extra info about the needed predictions into account.
• Can use local models that work well in the neighborhood of the target example.

8.1.5 Semi-Supervised Learning


The learner gets a set of labeled data and a set of unlabeled data. Information about the probability distribution of
examples can help the learner. Seen the little info on the slides this is probably not important.

31
8.1.6 Distance Definition
The representation of the data is very critical. This makes or breaks the NN algorithm.
For example s Manhattan, Euclidean or Ln − norm for numerical attributes:
#dim
Ln (x1 , x2 ) = n
P
|x1,i − x2,i |n
i=1
Hammings distance for nominal attributes:
n
P
d(x, y) = δ(xi , yi )
i=1
where δ(xi , yi ) = 0 if xi = yi , and δ(xi , yi ) = 1 if xi 6= yi

8.1.7 Normalization of Attributes


In order to avoid problems we normalize the attribute values. if we do this in order to capture 5 nearest neighbors
we need:
• 1 dim: 0.1% of the range
√ 1
• 2 dim: 0.1% = 0.1% 2 = 0.3% of the range
1
• n dim: 0.1% n
This is also called the curse of dimensionality.

8.1.8 Weighted Distances


Curse of Noisy Features: Big data sets with e.g. 10 dimensions already require almost 60% of the range. Therefore
irrelevant data destroy the metric’s meaningfulness.
But of course
s we have a solution for this: Weighted Distances!
PD
dw (x, y) = wj (xj − yj )2
j=1
Selecting attribute weights. We have several option:
• experimentally find out which weighs work well (cross-validation)
• Other solutions, e.g. Langley, 1996:
1. Normalize attributes (to scale 0-1)
2. Select weights according to ”average attribute similarity within class”

8.1.9 More distances


• Strings: Levenshtein distance/edit distance = minimal number of changes to change one word to the other.
Allowed edits: delete, insert, change.
s
Pn
• Euclidean: D(Q, C) ≡ (qi − ci )2 (Pythagoras!)
i=1

• Sequence Distances:
– Dynamic Time Warping: Sequences are aligned ”one to one” (non linear alignments are possible)
– Dimensionality reduction

32
8.2 Distance-weighted kNN
Idea: give higher weight to closer instances so we can now use all training instances instead of only k aka ”Shepard’s
method”.
k
P
wi f (xi )
• fˆ(xq ) = i=1
Pk with wi = 1
d(xq ,xi )2
i=1 wi

This results in a fast learning algorithm but it has slow predictions. Efficiency:
• for each prediction, kNN needs to compute the distance for ALL stored examples.
• Prediction time = linear in the size of the data set, for large training sets and/or complex distances this can
be too slow to be practical.

8.2.1 Edited k-nearest neighbor


• Less storage (good).
• Order dependent (bad).
• Sensitive to noisy data (bad).

• More advanced alternatives exist (= IB3).

The algorithm:
Incremental Deletion of Examples
Edited k-NN(S) S: Set of instances
For each instance x in S if x is correctly classified by S\x
Remove x from S
Return S

Incremental addition of examples Edited k-NN(S) S: Set of instances


T =∅
For each instance x in S
if x is not correctly classified by T
Add x to T
Return T

8.3 Pipeline Filters


Pipeline filters: Reduce time spent on far-away examples by using more efficient distance-estimates first. We can
eliminate most examples using rough distance approximations and compute more precise distances for examples in
the neighborhood.

8.4 kD-trees
kD-trees: use a clever data structure to eliminate the need to compute all distances. kD-trees are similar to decision
trees except:
• Splits are made on the median/mean value of dimension with highest variance
• Each node stores one data point, leaves can be empty

Finds closest neighbor in logarithmic (depth of tree) time. However building a good kD-tree may take some time:
Learning time is no longer 0 and incremental learning is no longer trivial:
• kD-tree will no longer be balanced
• re-building the tree is recommended when the max depth becomes larger than 2 * the minimal required depth
(= log(N) with N training examples).

33
Using Prototypes: the rough decision surfaces of nearest neighbor can sometimes be considered a disadvantage. We
can solve two problems at once by using prototypes (= representative for a whole group of instances) For example
prototypes can be:
• single instances, replacing a group
• other structure, (e.g., rectangle/shape, rule, ..)
• Radial basis function networks Basically builds a global approximation as linear combination of local approxi-
Pk
mations. f (x) = w0 + wu Ku (d(xu , x))
u=1
−1
d2 (xu ,x)
A common choice for Ku (d(xu , x)) = e 2δu2 . by using this the influence of each local approximation u
goes down quickly with distance.

8.5 Local Learning


• Collect k nearest neighbors
• Give them a supervised algorithm
• Apply learned model to test example

Locally weighted Regression Build local model in region around x (e.g. linear or quadratic model). Minimiz-
ing:
(f (x) − fˆ(x))2 .
P
• squared error for k neighbors: E1 (xq ) ≡
x∈kN N (xq )

(f (x) − fˆ(x))2 K(d(xq , x))


P
• Distance-weighted squared error for all neighbors: E2 (xq ) ≡
x∈D

8.6 Comments on k-NN


Positive
• Easy to implement
• Good ”baseline” algorithm / experimental control
• Incremental learning easy
• Psychologically plausible model of human memory
Negative
• Led astray by irrelevant features
• No insight into domain (no explicit model)
• Choice of distance function is problematic
• Doesn’t exploit/notice structure in examples

8.7 Decision Boundaries


Basically tries to make a partition (of as few divisions as possible) of the version space that indicates for each partition
to what class it belongs.

8.8 Sequential Covering Approaches


Also known as the ”Separate and Conquer” approach. General principle: Learn a rule set one rule at a time. It tries
to learn one rule that has a high accuracy (when it predicts something, it should be correct) and any coverage (does
not make a prediction for all examples just for some of them). Then mark the covered examples (these have been
taken care of; now focus on the rest). Repeat until all examples covered.

34
8.8.1 Candidate Literals
There are two separate methods to determining candidate literals for these algorithms.

Top-Down Learn One Rule


For this algorithm, we simply go through all of the possible combinations of categories and their values, i.e. (wind =
weak), (wind = strong), (temp = mild), (temp = cool), (humidity = normal), (humidity = high) are all the possible
candidate literals for the above algorithm from the example in the homework assignment.

Top-down Example-driven Learn One Rule


For this algorithm, we want to find the literals that have the highest accuracy. First, we select an arbitrary example
e (usually starting with e1 ) and we find out which literal value has the highest accuracy. For example, if (humidity
= normal) has accuracy of 34 , we count that as our first literal. However, because the accuracy is not 100%, we must
find a second example such that (hum = norm) AND (literal #2) have 100% accuracy.
In the example in the homework, when (temp = mild) it has 23 positive cases, where 22 are covered when in
conjunction with (hum = norm). Therefore, the first rule is:
1. IF (hum = norm) AND (temp = mild)
which covers e1 and e2 . However, this does not cover all positive cases. There still exists a third positive
example (e3 ). In this example, when (wind = weak) there are 22 positive examples. Since when (wind = weak)
we cover all remaining positive examples (e3 ), which leads to the conclusion that the second rule is:
2. IF (wind = weak)
and we are done.

8.8.2 Sequential covering


function LearnRuleSet(Target, Attrs, Examples, Threshold):
LearnedRules:= ∅
Rule:= LearnOneRule(Target, Attrs, Examples)
while performance(Rule,Examples)S> Threshold, do
LearnedRules := LearnedRules {Rule}
Examples := {Examples} \ {examples classified correctly by Rule}
Rule := LearnOneRule(Target, Attrs, Examples)
Optional: Sort learned rules according to performance
return LearnedRules

Learning One Rule


• Perform greedy search

• Could be top-down or bottom-up


– Top-down:
∗ Start with maximaly general rule (has maximal coverage but low accuracy)
∗ Add literals one by one
∗ Gradually maximize accuracy without sacrificing coverage (using some heuristic)
Top down has typically more general rules
– Bottom-up:
∗ Start with maximally specific rule (has minimal coverage but maximal accuracy)
∗ Remove literals one by one
∗ Gradually maximize coverage without sacrificing accuracy (using some heuristic)
Bottom up has typically more specific rules

35
8.8.3 Heuristics
When is rule considered a good rule?
• High accuracy

• High coverage (less important than accuracy)


Possible evaluation functions:
p
• Accuracy: p+n where p=# positives, n=# negatives
p+mq
• Variant on Accuracy: m-estimate: p+n+m . Weighted mean between accuracy on covered set of examples and
a priori estimate of true accuracy q (m is weight).

• Entropy: more symmetry between positive and negative

8.9 Example-driven Top-down Rule induction


Idea: for a given class c:
As long as there are uncovered examples for C
• pick one such example e
• consider He = rules that cover this example
• search top-down in He to find best rule

Much more efficient search (He much smaller than H (set of all rules).
Less robust with respect to noise; noisy example may require a restart.

8.10 Avoiding over-fitting


Post-pruning:
1. Split instances into Growing Set and Pruning Set
2. Learn set SR of rules using Growing Set

3. Find the best simplification BSR of SR


4. while(Accuracy(BSR, Pruning Set) ¿ Accuracy(SR, Pruning Set)) do
(a) SR = BSR
(b) Find the best simplification BSR of SR

5. return BSR

36
9 Lecture 9: Clustering
9.1 Unsupervised Learning
Data just contains x, there is no given classification or other information. The main goal is to find structure in the
data. The definition of ground truth is often missing (no clear error function like in supervised learning.

9.2 Clustering
Problem definition:
Let X = (x1 , x2 , ..., xd ) be a d-dimensional feature vector.
Let D be a set of vectors, D = X1 , X2 , ..., Xn Given data D, group the N vectors into K groups such that the
grouping is optimal.
Clustering is used for:
• Establish prototypes or detect outliers

• Simplify data for further analysis/learning


• Visualize data
• Preprocessing step for algorithms

• stand alone tool to get insight into data distribution


A good clustering method will produce clusters with
• High intra-class similarity
• Low inter-class similarity

• precise definition of clustering quality is difficult (application-dependent and ultimately subjective)

9.3 Similarity Measures


Possible options

• Distance Metric (Ln metric, ...)


• More general forms of similarity (Do not necessarily satisfy triangle inequality, symmetry, ...)

9.4 Flat vs. Hierarchical Clustering


Flat clustering: Given data set, return partition Hierarchical Clustering:
• Combine clusters into larger clusters, etc. until 1 cluster = full data set
• Gives rise to cluster hierarchy or taxonomy (taxonomy = grouping of classes; e.g. mammals - Felines - Tigers
etc.)

9.5 Extensional vs Intensional Clustering


Extensional clustering: Clusters are defined as sets of examples. Intensional clustering Clusters described in
some language. Typical criteria for good intensional clustering:

• High intra cluster similarity


• Simple conceptual description of clusters.

37
9.6 Cluster Assignment
• Hard clustering: Each item is a member of one cluster
• Soft Clustering: Each item has a probability of membership in each cluster
• Disjunctive clustering: An item belongs to only one cluster
• An item can be in more than one cluster
• Exhaustive clustering: Each item is a member of a cluster
• Partial Clustering: Some items do not belong to a cluster (in practice this is equal to exhaustive clustering
with singleton clusters)

9.7 Major Clustering Approaches


• Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion
• Partitioning: Construct various partitions and then evaluate them by some criterion
• Model-based: Hypothesize a model for each cluster and find best fir of models to data
• Density based: Guided by connectivity and density functions

9.8 Hierarchical Clustering


Can do top-down (devisive) or bottom-up (agglomerative). In either case we maintain a matrix of distance (or
similarity) scores for all pairs of instances, clusters (formed so far) or both.

9.8.1 Dendogram
Tree view on hierarchical clusters; how higher the topbar is (horizontal line) the higher the degree of difference within
cluster.

9.8.2 Bottom up Hierarchical Clustering


Given: instances x1 , ..., xn
for(i=1 to n) ci = {xi }
C = {c1 , ..., cn }
j=n
while size of C ≥ 1
j = j+1
(ca , cb ) =
S argminu,v dist(cu , cv )
cj = ca cv
add node to tree joining a and b
C = C \{ca , cb } ∪ cj
Return tree with root node j

9.9 Distance between two clusters


The distance between two clusters can be determined in several ways
• Single link: Distance of two most similar instances: dist(cu , cv ) = min{dist(a, b) | a ∈ cu , b ∈ cv }
• Complete link: distance of 2 least similar instances: dist(cu , cv ) = max{dist(a, b) | a ∈ cu , b ∈ cv }
• Average link: average distance between instances: dist(cu , cv ) = avg{dist(a, b) | a ∈ cu , b ∈ cv }
Computational complexity: Naive implementation has O(n3 ) time complexity, where n is the number of instances.
More advanced computations:
• Single link: Can update and pick pair in O(n), which results in O(n2 ) algorithm
• Complete and average link: Can do these steps in O(n log n), which yields an O(n2 logn) algorithm.

38
10 Lecture 10:
10.1 Reinforcement learning
Reinforcement learning stems from the situation where an agent only receives a reward after a sequence/series of
actions have been performed. It stems from biological and societal systems where an agent is given a reward (i.e.
Dopamine) based on a previous decision(s), instead of given constant guidance towards what is the correct or incorrect
decision.
In reinforcement learning, the agent typically does not possess full knowledge of the environment or the result of
each action. More formally:

• Given:
1. a Set of States S (known to the agent only after exploration)
2. a Set of Actions A (per state)
3. a Transition function: St = δ(st , at ) (unknown to agent) where δ represents the state transition
4. a Reward function: rt = r(st , at ) (unknown to agent)
• Find:

1. Policy π : S → A that outputs an appropriate action a from set A, given the current state s from set S
such that π(st ) = at .

10.2 Optimal Policy


The optimal policy π is found by maximizing the cumulative value/reward:

• V π (st ) = rt + γrt+1 + γ 2 rt+2 + ... ≡ γ i rt+i
P
i=0

where gamma 0 ≤ γ ≤ 1 is a ”discount factor” that leads us to prefer either immediate reward or delayed reward
(higher values of γ → later reward preference). Therefore, the optimal policy becomes:

• π ∗ ≡ ArgM axa V π (s), (∀s) where V π is the value function of the optimal policy for state s:

• V π (s) or V ∗ (s)

However, this demonstrates a problem. How can we learn the optimal policy π ∗ : S → A for arbitrary environ-
ments? Since training data hs, ai is not available, π ∗ cannot be directly learned because the agent can only directly
choose a and not s. This leads us to the concept of Q-Learning.

10.3 Q-learning Algorithm


Q-learning does not require a model aka it is model-free. It is also exploration-independent (off-policy).

10.3.1 Q-Learning Intuition


We want to maximize the sum of the rewards, doing maximization iteratively while exploring state-action pairs (s,
a) to explore the cumulative reward:

• π ∗ ≡ ArgM axa [r(s, a) + γV ∗ (δ(s, a))]

The problem with this is that the agent typically does not have perfect knowledge of δ (the state transitions) or
r (the reward in all states). This means that agents cannot predict the reward and the immediate successor state
→ V ∗ cannot be directly learned directly. Solution - learn the Q-values instead by computing the optimal
Q-values for all state-action pairs using the Bellman equation:

• Q(s, a) ← r + γmax( α0 )Q(s0 , a0 ) so the optimal policy becomes:


• π ∗ ≡ ArgM axa Q(s, a)

39
10.3.2 Learning the Q-Values
We use iterative approximation to learn the Q values for a given state-action pair:
• V ∗ = M axa0 Q(s, a0 )
So that we can rewrite:
• Q(s, a) = r(s, a) + γM axa0 Q(δ(s, a), a0 )
And we then obtain the recursive update rule that allows an iterative approximation of Q:
• Q̂(s, a) ← r(s, a) + M axa0 Q̂(s0 , a0 )
This way, the agent stores the value Q̂(s, a) in a large look-up table. Then the agent repeatedly observes its own
current state s, chooses some action a, and observes the resulting reward r(s, a) and the new state s0 = δ(s, a). This
way, the agent repeatedly samples from unknown functions δ(s, a) and r(s, a) without having full knowledge of these
functions.

10.3.3 Q-Learning Optimality


In deterministic environments (if the next state is perfectly predictable given knowledge of the previous state and the
agent’s action), Q-Learning is guaranteed to converge for infinite amounts of updates of each state-action
pair. In practice, infinite amounts of updates are not required to determine the optimal policy.

10.3.4 Accelerating the Q-Learning Process


One way to accelerate this process is to back-propagate the Q-values after a visit of a sequence of states. For this
you have to remember previously visited states within one run.
In this case, do we choose the next action that maximizesQ̂(s, a)? NO, because this risks the situation where no
new values are learned and it can become biased by initial random exploration, meaning that Q-Learning would
not converge.
The better choice is to balance exploration with the explotation of known Q-values. A probabilistic model:
kQ̂(s,ai )
• P (ai |s) = P Q̂(s,a )
j
K
j

Actions with higher Q̂(s, a) are more likely to be picked compared to other actions. High k = higher exploitation
factor, lower k = higher exploration factor.

10.3.5 Q-Learning Summary


• Q-Learning is model-free: Q-Learning does not need any information about the environment except for the set
of valid actions for each state.
• Given a chosen state-action pair, the environment will provide the rewards.
• Once these are given, a reinforcement learning technique such as Q-Learning explores the environment and the
connected reward autonomously and thus performs autonomous learning of the optimal policy.
• Q-Learning is guaranteed to converge given infinite iterations. It however converges in a reasonable
number of iterations.

10.4 Online Learning and SARSA


An off-policy learner learns the value of the optimal policy independently of the agent’s actions. An on-policy
learner learns the value of the policy being carried out by the agent, including the exploration steps.
Limitation of off-policy learning: There may be cases where ignoring what the agent actually does is dangerous (there
will be large negative rewards).
SARSA chooses the action that was actually chosen by the agent (rather than the best possible action argmaxQ(s, a)).
• Can take exploration into account
• Online and continuous learning

40
10.5 Expectation Maximization
Given the statistical model which generates a set X of observed data, a set of unobserved latent data or missing
values Z , and a vector of unknown parameters θ, along with a likelihood function L(θ);X ,Z) = p(X, Z|θ), the
maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the
observed data:

Z
• L(θ; X) = p(X|θ) = p(X, Z|θ)dZ

However, this quantity is often intractable (e.g. if z is a sequence of events, so that the number of values grows
exponentially with the sequence length, making the exact calculation of the sum extremely difficult).
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

• Expectation step (E step):


– Calculate the expected value of the log likelihood function, with respect to the conditional distribution of
Z given X under the current estimate of the parameters Θ(t) :
– Q(θ|θ (t) ) = EZ|X,θ(t) [log L(θ; X, Z)]
• Maximization step (M step): Find the parameters that maximize this quantity:

– θ (t+1) = arg max Q(θ|θ (t) ) θ (t+1) = arg max Q(θ|θ (t) )
θ θ

The typical models to which EM is applied uses Z as a latent variable indicating membership in one of a set of
groups:
The observed data points x may be discrete (taking values in a finite or countably infinite set) or continuous
(taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. The
missing values (aka latent variables) Z are discrete, drawn from a fixed number of values, and with one latent variable
per observed unit. The parameters are continuous, and are of two kinds: Parameters that are associated with all
data points, and those associated with a specific value of a latent variable (i.e., associated with all data points which
corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models.

41

You might also like