0% found this document useful (0 votes)

92 views41 pages

2017 Machine Learning Summary v4 PDF

This document is a summary of machine learning techniques from 2017 that was originally authored in 2014 and updated in 2017. It covers topics such as version spaces, decision trees, evaluating learning models, and more. It provides definitions and explanations of key concepts like classification tasks, inductive bias, overfitting, decision tree algorithms, and performance evaluation metrics. The summary also notes techniques for dealing with issues like continuous attributes, missing data, and attributes with many values.

Uploaded by

Paula Gitu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views41 pages

2017 Machine Learning Summary v4 PDF

Uploaded by

Paula Gitu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning Summary 2017

Pieter Schaap (2014), updated by Andrew Gold (2017)

March 13, 2018

Contents
1 Lecture 1: Version Spaces 5
1.1 Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Conjunction of Discrete Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Find s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 List elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Boundary Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Candidate Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8.1 Picking training instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.2 Unanimous-Voting rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.3 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.4 Unanimous Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Volume Extension Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.1 In Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 K-Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Lecture 2: Decision Trees 9

2.1 Decision Trees for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Now when can we use decision trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Decision Tree Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Hypothesis Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Inductive Bias in ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Overfitting, Underfitting, and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Causes of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Identifying Overfitness, Underfitness, and Optimality . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.5 Growing Set vs Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.6 Reduced-Error Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.7 Rule Post-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.8 Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.9 Reduction of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.10 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Dealing with continuous attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Oblique Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Attributes with Many Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Gain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Missing Attribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1
3 Lecture 3: Evaluation of Learning Models 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Evaluation of Classifiers Evaluation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Confidence Intervals for Estimates on Classification Performance . . . . . . . . . . . . . . . . . 16
3.2.4 Metric Evaluation TL;DR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Comparing Data-Mining Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Counting the Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Generating a Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 ROC Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Iso-Accuracy Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.3 Contructing ROC Curve for 1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.4 Area Under Curve Metric (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Lecture 4: Bayesian Learning 18

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Maximum a Posteriori Hypothesis (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Useful Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Brute Force MAP hypothesis learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Minimum Description Length Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Bayes Optimal Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.8 Gibbs Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.9 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Lecture 5: Linear Regression 20

5.1 Supervised Learning: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.1 Regression versus Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Cost function intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3.1 Least Squares Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.1 Choosing Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4.2 Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6 Normal Equation vs Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 Finding the ”right” model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Lecture 6: Logistic Regression and Artificial Neural Networks 25

6.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.1 Sigmoid Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.2 Non-Linear Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Gradient Descent for Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Multi-Class Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.2 Learning The Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5.3 Properties Of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2
7 Lecture 7: Recommender Systems 28
7.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Content Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3.1 Collaborative Filtering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3.2 Mean Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.1 Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.2 Non-Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.3 Logistic Regression to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.5 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.5 Compare SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Lecture 8: 31
8.1 Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.3 Lazy vs Eager Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.4 Inductive vs Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.5 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.6 Distance Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.7 Normalization of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.8 Weighted Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.9 More distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Distance-weighted kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2.1 Edited k-nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Pipeline Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.4 kD-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Local Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.6 Comments on k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.7 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8 Sequential Covering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8.1 Candidate Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.2 Sequential covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.9 Example-driven Top-down Rule induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.10 Avoiding over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9 Lecture 9: Clustering 37
9.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4 Flat vs. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.5 Extensional vs Intensional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.6 Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.7 Major Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.1 Dendogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.2 Bottom up Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.9 Distance between two clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3
10 Lecture 10: 39
10.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.2 Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3 Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.1 Q-Learning Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.2 Learning the Q-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.3 Q-Learning Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.4 Accelerating the Q-Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.5 Q-Learning Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.4 Online Learning and SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.5 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4
1 Lecture 1: Version Spaces
Version space learning is a logical approach to machine learning, specifically binary classification. Version space
learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the
hypothesis space is a disjunction:
• H1 ∨ H2 ∨ ... ∨ Hn

(i.e., either hypothesis 1 is true, or hypothesis 2, or any subset of the hypotheses 1 through n). A version space
learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x,
the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space
is called the candidate elimination algorithm (see 1.8), the hypothesis space maintained inside the algorithm its
version space.

Overview
• Classification Task
• FindS algorithm

• Version Spaces
• List Elimination Algorithm
• Boundary Sets and Candidate Elimination Algorithm
• Properties of Version Spaces

• Inductive Bias
• Version Spaces and Consistency Tests
• Volume Extension and k-Version Spaces

1.1 Classification Task

• Class is a set of objects with the same appearance structure or function.
• Elements are aspects of one (or more) objects.

• Classifiers are a set of elements that indicate that an object belongs to a certain class.
• The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be
returned by it (as being true).

So a classification task consists out of 4 properties: X, Y, H and D where: X:= The version space Y:= The evaluation
of an object in the version space (done by H) H:= The hypothesis space D:= The training data
Binary Classification task; e.g. |Y | = 2 Multi-Class Classification task; e.g. |Y | >= 2

1.2 Learning Classifiers

Essentially a search in the hypothesis space where the goal is to find a hypothesis to best fit the training data D. If this
hypothesis is consistent with a sufficiently large set of training data it will give a good approximation of other unob-
served instances. Consistency Criterion: Hypothesis h is consistent with D ⇔ h(x) = y f or each instance (x, y) in D
When ordering the hypothesis from ”General” to ”Specific” the following logic can be applied:
(∀h1, h2 ∈ H)((h1 ≥ h2) ⇔ (∀x ∈ X)(h1(x) = 1 ⇐ h1(x) = 1))

5
1.3 Conjunction of Discrete Attributes
How to generalize a hypothesis (h) with respect to an instance (x)?
For every attribute Ai in the hypothesis h where Ai is specified and contradicts the instance x: Set Ai of h to ?
(unspecified).
And how do we make it more specific? First we create an empty set that we call the specializations. Assuming
that the instance x is a positive object;
For every attribute value v of Ai of h that is not specified (=?) we create a specialization s that is equal to h and
set the value of attribute Ai of s to v. We then set the specializations set to be the union of itself and s. (end for)

1.4 Find s algorithm

initialize s to the most specific hypothesis in H.
For every training instance x we check if x is positive; if so, we generalize s against x. if it is not we check if s(x) =
1 (equals true) and if that is the case we stop.

1.5 Version Spaces

Definition: The version space VS(D) for the training data D is the set of all the consistent hypotheses in H. or in
mathematical notation:
V S(D) = {h ∈ H|consistent(h, D)}
The classification rule of a version space is the unanimous voting rule i.e. every hypothesis must evaluate object
x as true in order to be classified.

1.6 List elimination Algorithm

More commonly known as the ”List-then-eliminate algorithm”. Basically takes a list of all the Hypotheses in H and
for every training instance it removes every hypothesis from the list that is not consistent with the training instance.

1.7 Boundary Sets

Two types:

• Minimal Boundary Set (Most specific Set)

• Maximal Boundary Set (Most General Set)
In essence this means that if an hypothesis space H is admissible (fits) the training data D, then there exists an
hypothesis s in the minimal boundary set and an hypothesis g in the maximal boundary set such that for every
hypothesis in H the following holds: s ≤ h ≤ g
or more formally: (∀h ∈ H)((h ∈ V S(D)) ⇐⇒ (∃s ∈ S(D))(∃g ∈ G(D))(s ≤ h ≤ g)).

1.8 Candidate Elimination Algorithm

The candidate elimination algorithm incrementally builds the version space given a hypothesis space H and a set E
of examples. The examples are added one by one; each example possibly shrinks the version space by removing the
hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the
general and specific boundary for each new example.

• Candidate Elimination Algorithm(X,Y,E,H)

– Inputs:
∗ X: set of input features, X=X1,...,Xn
∗ Y: target feature
∗ E: set of examples from which to learn
∗ H: hypothesis space
– Output:
∗ general boundary GH

6
∗ specific boundary SH consistent with E
– Local
∗ G: set of hypotheses in H
∗ S: set of hypotheses in H
– Let G={true}, S={false};
1. for each e ∈ E do:
(a) if (e is a positive example) then compare e to Gi−1 (of the previous example).
i. Elements of G that classify e as negative are removed from G;
ii. Each element g in Gi−1 that contradicts with the same element in example e is removed from
the new general set G for example e.
iii. Non-maximal hypotheses are removed from S;
(b) else if (e is a negative example) then compare e to S of previous example:
i. Elements of S that classify e as positive are removed from S;
ii. Each element s of Si−1 that contradicts with the same element in the negative example e goes
into a new general set G where the contradicting element is the only specific element, and all
other elements are marked with a ?. If there are multiple elements e that contradict with the
same element in S, a new general set G is made. All contradictions get their own set G with
only ?’s and the single contradicting element.
∗ Each new general set is bound to the specific contradiction of the previous S.
∗ Then we eliminate from the new S (belonging to ei ) the negative elements in e that align
with the specific set S from the previous example.
iii. Non-minimal hypotheses are removed from G.

More elaborate explanation: http://artint.info/html/ArtInt_193.html

The candidate elimination algorithm converges to a correct description if:

• there are no errors in the training data.

• When the classifier of the target class is H (i have no idea what he means)

1.8.1 Picking training instances

When picking the next training instance the learner should request instances that correspond to exactly half of the
descriptions in the Version Space. Therefore the description of the target concept can be found with log2 |V S| number
of instances.

1.8.2 Unanimous-Voting rule

• Definition 1: This basically means that both (upper and lower) boundaries should agree on whether a training
instance is true or false and do not contradict the training instance. (true if the training instance is true, false
if the training instance is false).
• Definition 2: Given version space VS(D), an instance x ∈ X receives a classification VS(D)(x) defined as follows:

y : V S(D) 6= ∅ ∧ (∀h ∈ V S(D))y = h(x)
V S(D)(x) =
”?” : Otherwise.

• Definition 3: Volume V(VS(D)) of version space VS(D) is the set of all instances that are not classified by
VS(D).

7
1.8.3 Inductive Bias
Completeness of a version space: Version Space = complete ↔ for any dataset D there exists a hypothesis in H s.t.
H is consistent with D
Now the inductive Bias of Version Spaces is the assumption that a version space is incomplete! So when do we
speak of a correct inductive bias? Well, that is when the target hypothesis t is in the hypothesis space H and the
training data are noise free (all fields are known and are correct). According to the internet: Inductive Bias = The
assumption that the target concept is contained in the hypothesis space (doesn’t this contradict the slides? (above
statement))
However, upon reviewing this with someone we concluded that it is possible that the inductive bias is simply
the set of rules that we found from inductive learning over the training data which can then be used to classify new
instances. WE THINK!

1.8.4 Unanimous Voting

Theorem: For any instance x ∈ X and class y ∈ Y :
(∀h ∈ V S(D))(h(x) = y) ↔ (∀y 0 ∈ Y \ {y})V S(D ∪ {(x, y 0 )}) = ∅. In other words what this says is that for every
hypothesis in the version space that classifies instance x as class y it holds that for every other class x cannot be
classified as true. or in other other words: The Theorem states the unanimous-voting rule can be implemented if we
have an algorithm to test version spaces for collapse.
Unanimous voting can be used to determine whether we can correctly classify an instance.

1.8.5 Accuracy
So when can we reach 100% accuracy and when not? Well there are 3 cases:
• Case 1: Data is noise free and the hypothesis space H contains the target classifier. (100% accuracy)

• Case 2: The hypothesis space H does not contain the target classifier and thus we do not know for sure which
class the instance has.
• Case 3: The training data contains noise. Therefore we cannot be certain if we are classifying correctly.

1.9 Volume Extension Approach

The volume-extension approach is a new approach to overcome the problems with noisy training data and inexpressive
hypothesis spaces. If a version space V S(I + , I ) ⊆ H misclassifies instances, the approach is to find a new hypothesis
space H0 s.t. the volume of version space V S0 (I + , I ) ⊆ H0 grows and blocks instance misclassifications.
Theorem: Consider hypothesis space H and H’ such that:
(∀D)((∃h ∈ H) that is consistend(h,D) then: (∃h0 ∈ H 0 ) that is consistent(h’,D)) as well. Then, for any data set
D: V (V S(D)) ⊆ V (V S 0 (D))

1.9.1 In Practice
• Case 2: H does not contain the target classifier. The solution in this case is to add a classifier that classifies
the instance differently than the classifiers in VS(D). In other words, we extend the volume of VS(D)
• Case 3: When the datasets are noisy. The solution is again to add a classifier that classifies the instances
differently than the classifiers in VS(D) and we extend the volume of VS(D) again.

1.10 K-Version Spaces

k-Version spaces were introduced to handle noisy data. They were defined as sets of k-consistent hypotheses; i.e.
hypotheses consistent with all but k instances. Definition 1: Given a classifier space H and training data D, the
k-version space V Sk (D) is:
V Sk (D) = {h ∈ H|consistenk (h, D)},
where
consistentk (h, D) ↔ (∃Dk ⊆ P k(D))(∀(x, y) ∈ Dk )(y = h(x))
Theorem: if k2 > k1 then, for any data set D:
V (V Sk1 (D))T ODOV (V Sk2 (D))

8
2 Lecture 2: Decision Trees
Overview
Decision Trees for Classification

• Definition
• Classification Problems for Decision Trees
• Entropy and Information Gain

• Learning Decision Trees

• Overfitting, Underfitting, and Pruning
• Validation Set vs Growing Set
• Handling Continuous-Valued Attributes

• Handling Missing Attribute Values

• Alternative Measures for Selecting Attributes
• Handling Large Data: Windowing

2.1 Decision Trees for Classification

Definition: A decision tree is a tree where:
• Each interior node tests an attribute of some data set

• Each branch corresponds to an attribute value

• Each leaf node is labeled with a class (class node) of the data

2.1.1 Now when can we use decision trees?

Each instance consists of an attribute with discrete values; e.g. weather forecast = sunny or weather forecast =
rainy. The classification has to happen over discrete values (true or false; yes or no; 0 or 1 etc.) Decision trees can
have disjunctive descriptions; i.e. a path in a tree represents a disjunctive description. If the training set contains
errors or missing data then the decision tree is robust enough to deal with this.

2.1.2 Decision Tree Learning

There is a basic algorithm for learning a decision tree:
1. A ← the ”best” decision attribute for a node N.
2. Assign A as decision attribute for the node N.

3. For each value of A, create new descendant of the node N.

4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified, THEN STOP.
ELSE iterate over new leaf nodes

So in short what it does is for every decision attribute (e.g. weather forecast) create a child node and ”list” all
instances that apply to the leaf node. If it is all classified correctly (no leaf node contains a true and a false at the
same time).

9
2.1.3 Entropy
Basically what entropy does is calculate the impurity of the training data.
• E(S) = −p+ log2 p+ p− log2 p−

where S is a sample of the training data, p+ refers to the proportion of the positive training instances and p− to the
negative. This brings us to information gain:

2.1.4 Information Gain

Information gain is basically the expected reduction in entropy if a certain attribute A is selected to generate the
new leaf nodes. One can compute the information gain using the following formula:
P |Sv |
• Gain(S, A) = E(S) − |S| E(Sv )
v∈V alues(A)

where Sv = {s ∈ S|A(s) = V }, or the set of all samples s in S and A(s) are the attributes of sample S.

2.2 ID3 Algorithm

In informal terms, the ID3 Algorithm does:

• Determine the attribute with the highest information gain on the training set.
• Use this attribute as the root, create a branch for each of the values the attribute can have.
• For each branch repeat the process with subset of the training set that is classified by that branch.

2.2.1 Hypothesis Space

Hypothesis space = set of all decision trees defined over given set of attributes. ID3 guarantees a complete hypothesis
space; meaning that the target description is in the hypothesis space. it basically does a simple-to-complex hill
climbing search through this space where the evaluation function is the information gain. It only expands over 1
current decision tree; meaning that it only expands over 1 node in the previous decision tree and does not backtrack.
Note that ID3 uses the entire dataset at each step of the search.

2.2.2 Inductive Bias in ID3

(From Wikipedia) The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions
that the learner uses to predict outputs given inputs that it has not encountered. In machine learning,
one aims to construct algorithms that are able to learn to predict a certain target output. To achieve this, the
learning algorithm is presented some training examples that demonstrate the intended relation of input and output
values. Then the learner is supposed to approximate the correct output, even for examples that have not been shown
during training. Without any additional assumptions, this problem cannot be solved exactly since unseen situations
might have an arbitrary output value. The kind of necessary assumptions about the nature of the target function
are subsumed in the phrase inductive bias.
A classical example of an inductive bias is Occam’s Razor, assuming that the simplest consistent hypothesis
about the target function is actually the best. Here consistent means that the hypothesis of the learner yields correct
outputs for all of the examples that have been given to the algorithm.
Approaches to a more formal definition of inductive bias are based on mathematical logic. Here, the inductive bias
is a logical formula that, together with the training data, logically entails the hypothesis generated by the learner.
Unfortunately, this strict formalism fails in many practical cases, where the inductive bias can only be given as a
rough description (e.g. in the case of neural networks), or not at all.
When ”choosing” the inductive bias from the ID3 search we have some preferences on picking it.
• we prefer short trees

• we prefer trees with high information gain attributes near the root.
Note that the bias is not a restriction on the hypothesis space but a preference to some hypotheses.

10
2.3 Overfitting, Underfitting, and Pruning
Overfitting is the concept where a model contains more parameters than the data can reasonably suggest, or in
simpler terms: your models are learning too much from noise, and interpreting noise as actually meaningful data.
Therefore, overfit statistical models can suggest things that aren’t true, because it has learned too much from noise.
Overfitting generally happens when you have too many adjustable parameters than what would be optimal, or
more simply by being more complicated than necessary. Therefore, your model may ”learn” from noise a specific
example and assume that is actually an important characteristic, when in fact it was merely an outlier. Overfitting
can be avoided by being as general as possible, and then furthermore by finding some form of average between an
overfit and underfit model.
In science the principle of Occam’s Razor is the concept that the simplest solution is often the best or ”most
correct.” Essentially: ”Do not make things more complicated than necessary”. This view is also often used in machine
learning. When working with decision trees this holds as well. Big (complex) decision trees shelter the thread of
over-fitting. The bigger the tree, the bigger the risk of over-fitting.

2.3.1 Causes of Overfitting

(from Wikipedia) Overfitting is especially likely in cases where learning was performed too long or where training
examples are rare, causing the learner to adjust to very specific random features of the training data, that have
no causal relation to the target function. In this process of overfitting, the performance on the training
examples still increases while the performance on unseen data becomes worse.
Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known
data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from
the fact that information from all past experience can be divided into two groups: information that is relevant for the
future and irrelevant information (”noise”). Everything else being equal, the more difficult a criterion is to predict
(i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is
determining which part to ignore. A learning algorithm that can reduce the chance of fitting noise is called robust.
• Noisy training data.
• Small number of instances are associated with leaf nodes. (coincidental regularities may occur that are unrelated
to target concept).

2.3.2 Avoiding Overfitting

• Pre-pruning: Stop the tree from growing before it matches the training data perfectly.
– When to stop? (difficult) Some of the solutions:
∗ Stop when the number of leaf nodes becomes less than M training instances.
∗ Use a Validation Set: a set of instances used to evaluate the utility of nodes in decision trees.
Usually the training data is randomly split into a growing set and a validation set. The set must be
chosen in a manner that it is unlikely to have the same errors as the growing set. For example see
Reduced-Error Pruning further on in this document.
• Post-pruning: Allow the tree to over-fit, then tweak the tree afterwards. Can also couple an overfit model(s)
with underfit model(s) and finding some form of average between the two.

2.3.3 Underfitting
Underfitting occurs when a statistical model or machine learning algorithm cannot adequately capture the underlying
structure of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the
model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and
low bias). It is often a result of an excessively simple model.

2.3.4 Identifying Overfitness, Underfitness, and Optimality

• Overfitness:
– When performance on training data increases while performance on unseen data/testing data decreases.
The training data is being learned while the unseen data is being misclassified. On a graph it can also be
identified by a wide gap between the training data’s accuracy vs the testing data accuracy.

11
• Underfitness:
– When performance is poor (error is high, accuracy is low) on both the training AND unseen/testing data.
The model is too generic, and it is not learning enough, leading to poor performance all around. Can be
identified on a graph by seeing low accuracy rates for both sets of data.

• Optimality:
– When performance on both training data and the unseen/testing data follows a very similar pattern,
meaning that something that is affecting the training data is also affecting unseen data, leading to the
conclusion that something else besides model fitness is at play.

2.3.5 Growing Set vs Validation Set

In making a decision tree, we can split the data into two sets: the Growing Set and the Validation Set. When
creating these sets, we (randomly or via some heuristic) remove some examples from the overall data set and put
them into a validation set. We then use the remaining examples as the growing set. The validation set is evaluated
and used as a metric to inform the model when constructing predictions for future models.
When the validation set is of a sufficient size (dependent on the specific model, difficult to generalize here) we
can get sufficient results from the decision tree. However, there are a few things to take into account.

1. As the validation set grows, the growing set shrinks, and vice-versa.
2. If the validation set is too small, it can make extremely general inferences on the data it contains, which it
then uses to inform the decision tree which can lead to an overly-pruned and too-small decision tree.
3. If the validation set is too large, it can lead to an under-pruned and too-large decision tree, leading to inefficiency
when making the decision tree.
4. The size of the validation set is subjective relative to the data, and is often best ”played around with” in order
to generate the most efficient results, measured by other metrics such as relative error rates.

2.3.6 Reduced-Error Pruning

One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with
its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive,
reduced error pruning has the advantage of simplicity and speed.
• Sub-tree replacement So for pruning a decision node d we do the following:

1. Remove the sub-tree that has node d as root.

2. d is a leaf node now.
3. assign d the most common classification of the training instances associated with d. I.e. see if it is more
likely that at this point the class is true or false and use that as the new leaf node.

We do the above until further pruning is harmful: Evaluate impact on validation set for each node that can be
pruned and remove the sub-tree that most improves validation set accuracy.
• Sub-tree raising
1. Remove the sub-tree that has the parent of node d as root.
2. Place d at the place of its parent
3. Sort the training instances associated with the parent of d using the sub-tree with rood d.
Then again evaluate if the accuracy of the tree on the validation set has increased.

12
2.3.7 Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.

3. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent
instances.

So for converting into rules we do the following: Start at the root node; for every path to a leaf node we create a
rule using AND operators. Then for every rule try to prune it independently (see if you can achieve higher accuracy
by removing conditions in the rule).

2.3.8 Impurity
Impurity: The diversity of training instances. A high impurity means that of every class there is an equal amount of
instances. A low impurity means that every instance is of the same class. More formally we can describe impurity
as follows: Let S be a sample of training instances; pj the proportions of instances of class j (j=1, ..., J) in S. An
impurity measure ( I(S) ) must satisfy the following:
• I(S) is minimum only when pi = 1 and pj = 0f orj 6= i (all objects are of same class)
1
• I(S) is maximum only when pj = J (there is exactly the same number of objects of all classes)
• I(S) is symmetric with respect to p1 , ..., pJ

2.3.9 Reduction of impurity

Basically the best split is the split that expects to decrease the impurity the most. This expected decrease in impurity
can be calculated as follows:
∆I(S, A) = I(S) − ( |S
P a|
|S| I(Sa ))
a
Where Sa is the subset of objects from S s.t. A=a ∀∆I is called a score measure or a splitting criterion.

2.3.10 Gini Index

Another way of measuring impurity is the Gini index. It measures how often a randomly chosen element from the
set wouldP
be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. i.e.
I(LS) = pj (1 − pj )
j

2.4 Dealing with continuous attributes

2 solutions:
1. Pre-discretize, e.g. Cold if temp < 10 degrees Celsius.
2. Discretize during tree growing
Now the problem is to find out where to make the ”cut point” during discretization. We cut at the point with
the highest information gain (highest impurity decrease (∆I))

2.5 Oblique Decision Trees

Rather than testing just 1 attribute some test conditions may involve multiple attributes. This allows more expressive
representation. However finding the optimal test condition is computationally expensive.

13
2.6 Attributes with Many Values
If attributes have a lot of values this poses 2 problems:
1. No good splits: they fragment the data too quickly, leaving insufficient data at the next level.

2. High reduction of impurity

However we also have 2 solutions:
1. Add a penalty to attributes with many values when applying the splitting criterion.
2. Consider only binary splits.

2.6.1 Gain Ratio

Inf oGain(S,A)
One of these ways of applying a penalty is the Gain Ratio. GainRatio = SplitInf ormation(S,A) But this method is
not flawless; the gain ratio favours unbalanced tests.

2.7 Missing Attribute Values

Another problem that we will come across is missing attribute values. there are a few strategies to deal with this:
• Assign the most common value of A among other instances belonging to the same concept.

• If node n tests the attribute A, assign most common value of A among other instances sorted to node n.
• If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are
estimated based on the observed frequencies of the values of A. These probabilities are used in the information
( |S v|
P
gain measure (via info gain) ( |S| E(Sv ))
v∈V alues(A)

2.8 Windowing
Lastly if we don’t have enough memory to fit all the training data in we can use a technique named windowing:
1. Select randomly n instances from the training data D and put them in window set W.
2. Train a decision tree DT on W.

3. Determine a set M of instances from D misclassified by DT.

4. W = W ∪ M
5. IF Not(StopCondition) THEN Go to 2;

14
3 Lecture 3: Evaluation of Learning Models
Overview
• Motivation
• Metrics for Classifier’s Evaluation
• Methods for Classifier’s Evaluation
• Comparing Data Mining Schemes
• Costs in Data Mining
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves

3.1 Motivation
Why evaluate classifier’s generalization performance (how good is the classifier in practice)
• Determine whether to employ classifier. I.e.: When using a limited data set for training we need to know how
accurate the classifier is in order to determine whether we can deploy the classifier)
• Optimization purposes. E.g. When post pruning, the accuracy must be determined on every pruning step.

3.2 Evaluation of Classifiers Evaluation Performance

3.2.1 Confusion Matrix
Basically a matrix that visualises the correctly and incorrectly identified classes. It makes a distinction between True
Positive, True Negative (both correct) and false positive and false negative (Both incorrect). i.e.:
Predicted Class
Positive Negative
Actual Class Positive True Positive False Negative
Negative False Positive True Negative

3.2.2 Metrics
There are various metrics to evaluate a classifier:
T P +T N
• Accuracy = P +N = Ratio of correctly classified instances
F P +F N
• Error = P +N = Ratio of incorrectly classified instances
TP
• Precision = T P +F P = Ratio of correctly positively classified instances
TP
• Recall/TP rate (TPR) = P = Ratio of correctly classified positive instances
FP
• FP Rate (FPR) = N = Ratio of incorrectly classified negative instances
So to which data can we apply these metrics? Before we start we need to define stratification:
When stratificating data make sure that each class is represented with approximately equal proportions. This is a
more advanced version of balancing the data.
• Training Data (Not a good indicator because training data are not a good performance indicator for future
data)
• Independent test data (Requires plenty of data and a natural way to forming training and test data)
• Hold-out method (Data is split in training and test data usually 2/3 and 1/3 respectively. However if the data
is unbalanced samples may not be representative, e.g. few or no instances of a certain class)

15
• Repeated hold-out method (More reliable than regular hold-out method due to the fact that it repeats the
process with randomly selected different sub-samples possibly with stratification. But this method does not
avoid overlapping test data nor does it guarantee that all instances are used at least once)
• k-fold cross-validation method (Split data into k equally sized stratified subsets then each subset is used for
testing and the remainder for training. The metric estimates are averaged to yield an overall estimate. Standard
method = 10-fold stratified cross-validation. 10-fold gives best results, stratification reduces estimate’s variance.
Further improvement: Repeated 10-fold stratified cross-validation reduces the estimate’s variance even further)
• Leave-one-out cross-validation (number of folds = number of training instances. Makes best use of the data
BUT computationally expensive. Involves no random sub-sampling. Does not allow stratification. Worst case
scenario: data set split equally into 2 classes: 50% accurate on fresh data but estimated error is 100%)

• Bootstrap method aka 0.632 bootstrap (Cross-validation, but with replacement. Idea: take n samples (size
1) of a dataset with replacement to create a training set. Instances from original dataset the don’t occur
in the new training set are used for testing. Probability of instance ending up in test data = e−1 = 0.368
i.e. test data ≈ 36.8% of instances ⇔ training data ≈ 63.2%. requires special error estimation: error =
0.632 ∗ etestinstances + 0.368 ∗ etraininginstances where ex is the error of subset x. Repeat process several times
with different replacement samples and average the results.)
• And many more

3.2.3 Confidence Intervals for Estimates on Classification Performance

If the test data contains more than 30 examples drawn independently
q of each other: then with approximately N%
errors (h)(1−errors (h))
probability, errorD (h) lies in the interval errors (h) ± ZN ∗ n
N% 50% 68% 80% 90% 95% 98% 99%
where errors (h) = estimated error and errorD (h) is the ac-
ZN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
tual error

3.2.4 Metric Evaluation TL;DR

Data size: large medium small
Favourable method: test sets cross validation leave-one-out
hold-out bootstrap
Also, don’t use test data for parameter tuning, use separate validation data instead.

3.3 Comparing Data-Mining Classifier

Intuition says: train & test using cross validation or bootstrap and rank classifier according to performance. However
we don’t make things easy, do we?

3.3.1 Counting the Costs

Different classification errors come at different costs. e.g. terrorist profiling, loan decisions etc. In some case you
prefer false positives in some cases you prefer false negatives. From this one can create a so called Cost matrix:
Actual → Positive Negative
Hypothesis ↓
The cost of TP and TN are usually set to 0.
Positive TP Cost FN Cost
Negative FP Cost TN Cost
Now we can talk about Cost-Sensitive Classification.

3.3.2 Cost-Sensitive Classification

If a classifier outputs probabilities for each class, we can adjust it to minimize the expected costs of the predictions.
Meaning that if we falsely classify we do it at the least possible cost.
The Expected cost is computed as the dot product of the vector of class probabilities and the appropriate column
in the cost matrix.
There are some simple methods for cost sensitive learning:

16
• Re-sampling of instances according to costs
• Weighting of instances according to costs.

3.4 Lift Charts

In practice decisions are made by comparing possile scenarios and taking into account different costs. In order to
deal with this we generate lift charts.

3.4.1 Generating a Lift Chart

What we do is, we sort instances to probability of true positive. And then we can draw a graph with on the x-axis
the sample size and on the y axis the number of true positives.

3.5 ROC Curves

An ROC curve describes the rates of True Positive Rate (TPR) (y-axis) versus the False Positive Rate (FPR) (x-axis).
With this information you can also extract the rates of False negative (1-y) and true negative (1-x). A convex curve
means that there is a good separation between classes. Concavities indicate that there is poor separation between
the classes.
ROC curves and lift charts can be used for internal optimizationV of classifiers.
A classifier A dominates a classifier B ⇔ T P RA > T P RB F P RA < F P RB
If certain classifiers lie on a diagonal line in a ROC-space (i.e. the rates are equal 0.5 = 0.5), this means that the
TPR = FPR. In this case, since P = N, we have:
(T P R∗P )+(T N R∗N )
• P +N =
(T P R∗P )+((1−F P R)∗N )
• = P +N (because TNR = 1-FPR)
(T P R∗P )+((1−T P R)∗N )
• = P +N (because FPR = TPR in this case)
(T P R∗(P −N ))+N
• = P +N (because P = N in this case)
N
• = P +N (because T P R ∗ (P − N ) = 0)

3.5.1 ROC Convex Hull

Also denoted as ROCCH and is determined by the dominant classifiers. Classifiers that are on the ROCCH achieve
the best accuracy and classifiers below the ROCCH are always sub-optimal. Any performance on a line segment
connecting two ROC points can be achieved by randomly choosing between them. The classifiers on ROCCH can be
combined to form a hybrid.

3.5.2 Iso-Accuracy Lines

Iso accuracy lines are lines that denote that same accuracy over the ROC space. This means that if it connects 2
ROC points they have the same accuracy. Iso accuracy lines have the slope N
P . Higher iso-accuracy lines are better
(higher as in higher accuracy/true positive rate).

3.5.3 Contructing ROC Curve for 1 Classifier

1. Sort instances on probability of being positive
2. move a threshold on the sorted instances.
3. For each threshold define a classifier with confusion matrix.
4. Plot the True positive rate and the False positive rate of the classifiers.

3.5.4 Area Under Curve Metric (AUC)

The area under the curve assesses the separation of the classes. A high area under the ROC curve means that there
is a good separation. The area under the curve estimates that randomly chosen positive instances will be ranked
before randomly chosen negative instances.

17
4 Lecture 4: Bayesian Learning
4.1 Introduction
• Each observed training instance can incrementally decrease or increase the estimated probability that a hy-
pothesis is correct.
• Prior knowledge is combined with observed data to determine the final probability of a hypothesis.
• Bayesian methods accomodate hypotheses that make probabilistic predictions (e.g. 93% chance of recovery)
• Instances are classified by combining predictions of multiple hypotheses, weighted by their probabilities.
• Requires initial knowledge of many probabilities.
• High computational cost.
• Is a standard for optimal learning.

4.2 Bayes Theorem

Goal: Determine the final probability of hypothesis h given the data D from:
• Prior probability of h, P(h): background knowledge about chance that h is correct regardless of observed
data.
• Prior probability of D, P(D): probability that training data D will be observed without knowledge about
which hypothesis h holds.
• Conditional Probability of observation D, P (D | h): probability of observing data D given some world
in which hypothesis h holds.
Now our goal was the Posterior probability of h: P (h | D) i.e. probability that h holds given training data D.
The Bayes theorem allows us to compute P (h | D)!
P (h | D) = P (D|h)P
P (D)
(h)

4.3 Maximum a Posteriori Hypothesis (MAP)

The Maximum a Posteriori Hypothesis is the most probable hypothesis. i.e. the hypothesis h in the hypothesis space
that has the highest P (h | D).

4.4 Useful Formulas

• Product Rule: P (A ∧ B) = P (A | B)P (B) = P (B | A)P (A)
• Disjunction Rule: P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
n
P
• Theorem of Total Probability: P (B) = P (B | Ai )P (Ai )
i=1

4.5 Brute Force MAP hypothesis learner

Boils down to: Calculate posterior probability (P (h | D)) for every hypothesis. Then pick the hypothesis with the
highest probability.

4.6 Minimum Description Length Principle

This is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to
the best compression of the data.
Given the hypothesis, this principle maximizes the prior probability of the product of P (D|h) ∗ P (h):
• hM AP = ArgM axP (D|h) ∗ P (h)
• = ArgM ax(Log2 P (D|h) + Log2 P (h))
• = ArgM in(−Log2 P (D|h) − Log2 P (h))

18
4.7 Bayes Optimal Classifier
Another problem is the following: Given data D, hypothesis space H, and a new instance x, what is the most probable
classification of x? It is not the most probable hypothesis in H. The Bayes optimal classifier assigns to an instance
the classification cj that has the maximum posterior probability P (cj | D). Now the maximum posterior probability
P (cj | D) is calculated using the theorem for total probability. It is calculated using all the hypotheses weighted by
their posterior probabilities w.r.t. the data D:
vOB = arg maxcj ∈{+,−} P P (cj | D)
= arg maxcj ∈{+,−} P (cj | hi )P (hi | D)
hi ∈H
Best classification method according to its average accuracy. However the bayes optimal classifier may not be in
the hypothesis space!

4.8 Gibbs Classifier

1. Choose hypothesis at random according to P (h | D)

2. Use this hypothesis to classify new instance

Actual error: E[errorGibbs ] ≤ 2E[errorBayesOptimal ]

4.9 Naı̈ve Bayes Classifier

Given attributes a ∈ A and values v ∈ V, calculate the maximum probability for values Vi such that:

Q
• vM AP = arg max P (Vj ) P (ai | vj )
i

It assumes that attributes are conditionally independent!

To estimate the probability P (A = v | C) of an attribute-value A = v for a given class C we use:

• Relative Frequency: i.e. nnC where nC is the number of instances that belong to class C and have value v
for the attribute A, and n is the number of training instances of the class C.
nc +mp
• M-estimate of accuracy n+m where p is the prior probability of P (A = v | C) and m is the weight of p.
We take the normalized probability of the outcomes of the above, and the one with the higher probability is the
one that is classified as positive.

19
5 Lecture 5: Linear Regression
TL;DR
Linear regression is the act of trying to define a function Y given an input vector X based on the values x ∈ X that
best describe the patterns of X. We usually do this by finding the least-square error between data point x and the
approximated function Y, or by minimizing a penalized version of the least squares loss function.

5.1 Supervised Learning: Regression

Linear regression models the relationship between a scalar dependent variable y and one or more explanatory variables
(or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For
more than one explanatory variable, the process is called multiple linear regression.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model param-
eters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of y
given the value of X is assumed to be an affine function of X ; less commonly, the median or some other quantile of
the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis,
linear regression focuses on the conditional probability distribution of y given X .

5.1.1 Regression versus Classification

When do we consider a problem as a classification or regression problem? A classification problem is for identifying
individual cases (true or false, 0 or 1) whereas regression problems deal with predicting (continuous) amounts/values
for products.

5.2 Linear Regression

Given a training set of values (vector) X, apply a learning algorithm and try to learn a hypothesis h, represented as
a linear function where h is a function that maps x values to y results:
• y = hΘ (x) = Θ0 x0 + Θ1 x1 ...Θn xn for every decision variable xi where ( 0 ≤ i ≤ n).
Except how do we calculate the parameters Θ?

5.3 Cost function intuition

We want to choose Θ0 , Θ1 such that hΘ (x) is close to y for our training examples (x, y). The idea behind the cost
function is that we want to minimize the total distance between the (estimation) line and the training data. When
minimizing the cost, we often normalize by m so that we can view the cost function as an approximation of the
”generalization error,” or the expected square loss on a randomly chosen new example. Put more simply, we are
minimizing the error rate instead of the total error. For models with 1 variable:

• Hypothesis:
– hΘ (x) = Θ0 + Θ1 x
• Parameters:
– Θ0 , Θ1
• Cost Function J(Θ0 , Θ1 ):
m
1
(hΘ (x(i) ) − y (i) )2 , where:
P
– J(Θ0 , Θ1 ) = 2m
i=1

– hΘ (x(i) ) − y (i)

is the minimized difference between the calculated result and the actual test data. To find out optimal values for
the parameters Θ0 and Θ1 we want to minimize the difference between the calculated result and the actual result of
our test data.
We attach the coefficient 12 to prevent the square 2 from having an effect on the resulting derivative. We also
divide by the number of summands m to get the average cost per data point.

20
The error measure in the cost function is a ”statistical distance”; in contrast to the popular and preliminary
understanding of distance between two vectors in Euclidean space. With statistical distance we are attempting to
map the ”dis-similarity” between estimated model and optimal model to Euclidean space.
There is no constricting rule regarding the formulation of this statistical distance, but if the choice is appropriate
then a progressive reduction in this ’distance’ during optimization translates to a progressively improving model
estimation. Consequently, the choice of ’statistical distance’ or error measure is related to the underlying data
distribution.

21
5.3.1 Least Squares Error
Given a collection of data points (xi , yi ) once you have your hypothesis h for some Θ, your least squares error of h
on a single data point Θ is:
• (hΘ (xi ) − yi )2
1
If we sum up the errors for all Θ, we multiply by 2 to prevent the square 2 from having an effect on the derivative,
resulting in the total error:
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2
i=1

We also divide the total error by the number of summands m to get the average error per data point, giving
1
us the resulting coefficient of 2m .
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2m
i=1

When comparing performance on two data sets of different size, the raw sum of squared errors are not directly
comparable because larger data sets tend to lead to higher error totals. When you normalize, you can compare
the average error per data point.

5.4 Gradient descent

Gradient Descent is a very well-known algorithm for finding maxima and minima, however it can get stuck in localities
(local minimum). It is used in all sorts of optimization problems, not just regression. It is relatively simple compared
to other more sophisticated techniques, yet is still useful.
Gradient Descent is an iterative algorithm for finding max/min of a function:
1. Start with some Θ0 , Θ1
2. Keep updating Θ0 , Θ1 to reduce J(Θ0 , Θ1 ) until you (hopefully) reach a minimum.
Mathematically:
1. hΘ (x) = Θ0 + Θ1 x
m
1
(hΘ (x(i) ) − y (i) )2
P
2. J(Θ0 , Θ1 ) = 2m
i=1
m
1
(hΘ (x(i) ) − y (i) )2 ) (repeat until convergence)
P
3. Θ0 := Θ0 − α( 2m
i=1

where m is the no. of data points and α is the learning rate. (Usually pre-defined)

5.4.1 Choosing Learning Rate

We don’t want the learning rate α to be too small or too big:
• Too small: Slow convergence
• Too big: gradient step may overshoot (and thus we do not converge, leading to an endless loop)

5.4.2 Multiple Features

Gradient descent can also be used for multivariate linear regression, where the cost function would be:
• J(Θj ) = hΘ (x) = Θ0 + Θ1 x1 + Θ2 x2 + ... + Θn xn
and the gradient descent algorithm would look like this:
Repeat until converged :
m
1 (i)
(hΘ (x(i) ) − y (i) )xj
P
1. Θj := Θj − α m
i=1

NOTE: Simultaneously update every Θj ! Only after updating ALL Θ’s should you update hΘ (x)!

22
5.5 Normal Equation
5.5.1 Feature Scaling
With feature scaling we get all features in the [-1, 1] range. Basically what we do is we standardize the range of
independent variables or features of data, because scaling ensures that if some feature values are large it won’t lead
to them being used as a main predictor. This may optimize performance for the gradient descent algorithm and is
known as the normal equation.

5.5.2 The Algorithm

The normal equation is performed as follows:
1
• we’ll have to minimize for Θ: 2 [XΘ − y]T [XΘ − y] which effectively boils down to:
• X T XΘ − X T y and then setting the gradient to zero:
• X T XΘ = X T y from which follows that:
• Θ = (X T X)−1 X T y - note: the -1 means matrix inversion here.

5.6 Normal Equation vs Gradient Descent

• Gradient Descent
– Need to choose α
– needs many iterations
– works well even when the number of features is large
• Normal Equation
– No need for α
– No need to iterate
– Needs to compute (X T X)−1
∗ O(n3 )
∗ might be non-invertible

5.7 Finding the ”right” model

There are 2 problems that we are facing, overfitting and underfitting. This can be solved by one of the following:
1. Reducing the number of features.
• Manually select which features to keep
• Model selection algorithm
2. Regularization
• Keeps all the features but reduces the magnitude of parameters Θj .
• Works well when we have a lot of features, each of which contributes a bit to predicting y.

5.7.1 Regularization
When applying regularization we alter the cost function into the following:
m n n
1
(hΘ (x(i) ) − y (i) )2 + λ Θ2j ] Where the regularization term we add is: λ Θ2j ]
P P P
• J(Θ) = 2m [
i=1 j=1 j=1

Regularization parameter λ is an input parameter to the model. Lambda can be selected by sub-sampling the
data and finding the variation. The value of lambda can reduce overfitting as it increases, however it does
this at the expense of greater bias.

23
For the gradient descent algorithm it would look as follows:
m
1 (i) λ
(hΘ (x(i) ) − y (i) )xj +
P
• Θj := Θj − α[ m m Θj ], then
i=1

m
λ 1 (i)
(hΘ (x(i) ) − y (i) )xj ]
P
• Θj := Θj (1 − α m ) − α[ m
i=1

For the normal equation:

 −1
0 0 0 0 0
0 1 0 0 0 
 
Θ = (X T X + λ 
0 0 1 0 0  X y
 T
0 0 0 ... 0
0 0 0 0 1
Two advantages:

1. Fights over-fitting
2. Guarantees matrix of full rank, and thus invertible

24
6 Lecture 6: Logistic Regression and Artificial Neural Networks
6.1 Logistic Regression
We can cast a binary classification problem into a continuous regression problem. However we can not simply use the
linear regression that we mentioned before. Logistic regression is used when the variable y that we want to predict
can only take on discrete values (i.e. Classification). Considering a binary classification problem (y = 0 or y = 1),
the hypothesis function could be defined so that it is bounded between [0, 1] in which we use some form of logistic
function, such as the Sigmoid Function. Other, more efficient functions exist such as the ReLU (Rectified Linear
Unit), however there are not covered in this course as the sigmoid function is a historical standard.

6.1.1 Sigmoid Logistic Regression

One option is to use a sigmoid function. Why? Because it allows hΘ (x) to only have values between 0 and 1. This
means a more fluent transition is made from false to true.
Sigmoid function:
1
• g(x) = 1+e−z

Now for the hypothesis:

1
• hΘ (x) = g(ΘT x) = 1+e−ΘT x

The decision boundary for the logistic sigmoid function is where hΘ (x) = 0.5 (values less than 0.5 means false,
values equal to or more than 0.5 means true). Another interesting property is that it also gives a chance of the instance
being of that class e.g. hΘ (x) = 0.7 means that there is a 70% chance that the instance is of the corresponding class,
so we get:
• hΘ (x) = g(Θ0 + Θ1 x1 + Θ2 x2 ) and we predict y=1 if:

• −3 + x1 + x2 ≥ 0

6.1.2 Non-Linear Decision Boundaries

Now in the above cases of logistic regression we are speaking of a linear decision boundary (meaning we can draw
a straight line between the class and other instances. However sometimes this is not the case. When dealing with
non-linear decision boundaries we use higher order polynomials in order to be able to classify these cases, e.g.:

• hΘ (x) = g(Θ0 + Θ1 x1 + Θ2 x2 + Θ3 x21 + Θ4 x22 ) and we predict y=1 if:

• −1 + x21 + x22 ≥ 0

6.2 Cost Function

Given a new hypothesis, we now need a cost function. However, just by using the sigmoid function, we end up with
a non-convex cost function. This means that local minima are found must faster than the global minimum, and
leads to slow or incorrect learning.
m
1
P
• J(Θ0 , Θ1 ) = 2m Cost(hΘ , y) where
i=1

• Cost(hΘ , y) is 12 (hΘ (x) − y)2

However, by using the sigmoid function, we end up with a non-convex cost function. This means that local
minima are found must( faster than the global minimum, and leads to slow or incorrect learning.
−log(hΘ (x)), if y = 1
Cost(hΘ (x), y) =
−log(1 − hΘ (x)), if y = 0
This means that the optimization objective function can be defined as the mean of the costs/errors in the
training set:
m
1
Err(hΘ (x(i) , y (i) )
P
• J(Θ) = m
i=1

25
6.3 Gradient Descent for Logistic Regression
How do we find the right Θ parameter value? We use gradient descent!
• Repeat until convergence:
m
1 (i)
(hΘ (x(i) ) − y (i) )xj
P
1. Θj := Θj − α m
i−1

NOTE: Simultaneously update all Θj !

1
Looks identical to linear regression but with hΘ (x) = 1+e−ΘT x
and with regularization:
Repeat until convergence:
δ
1. Θj := Θj − α δΘ j
J(Θ), or:
m
δ 1 (i)
(hΘ (x(i) ) − y (i) )x0
P
(a) δΘ0 J(Θ) = m
i=1
m
δ 1 (i) λ
(hΘ (x(i) ) − y (i) )x1 +
P
(b) δΘ1 J(Θ) = m m Θ1
i=1
m
δ 1 (i) λ
(hΘ (x(i) ) − y (i) )x1 +
P
(c) δΘ2 J(Θ) = m m Θ2 ...
i=1

6.4 Multi-Class Problems

Simply make k copies of ”One vs All”. When predicting pick the class with the highest probability (highest outcome
of hΘ ).

6.5 Artificial Neural Networks

Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute
animal brains. Such systems learn (progressively improve performance on) tasks by considering examples, generally
without task-specific programming. An ANN is based on a collection of connected units or nodes called artificial
neurons (analogous to biological neurons in an animal brain). Each connection (synapse) between neurons can
transmit a signal from one to another. The receiving (postsynaptic) neuron can process the signal(s) and then signal
neurons connected to it.
In common ANN implementations, the synapse signal is a real number, and the output of each neuron is calculated
by a non-linear function of the sum of its inputs. Neurons and synapses typically have a weight that adjusts as learning
proceeds. The weight increases or decreases the strength of the signal that it sends across the synapse. Neurons may
have a threshold such that only if the aggregate signal crosses that threshold is the signal sent.
Typically, neurons are organized in layers. Different layers may perform different kinds of transformations on
their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers
multiple times. Alternative architectures include :
• Recurrent Networks (gives memory effect (e.g. counting, adding etc.)
• Multi class Problems

6.5.1 Forward Propagation

With Neural Networks, we’re trying to find a minimum of some certain function, where each neuron is connected to
all other neurons in the previous layer, where the weights in the weighted sum are acting like the strength of each of
those connections. The bias is some indication whether that specific neuron tends to be active or inactive.
(j)
• ai = ”activation” of unit i in layer j:
• Θ(j) = matrix of weights controlling function mapping from layer j to layer j+1. It has dimension sj+1 by(sj +1)
where sj is the number of nodes on layer j

26
so:
(2) (1) (1) (1)
• a1 = g(Θ10 x0 + Θ11 x1 + Θ12 x2 )
(2) (1) (1) (1)
• a2 = g(Θ20 x0 + Θ21 x1 + Θ22 x2 )
(2) (2) (2) (2) (2) (2)
• hΘ (x) = g(Θ10 a0 + Θ11 a1 + Θ12 a2 )

6.5.2 Learning The Weights

Back propagation; uses gradient descent similar to lin. & log. regression. Where do we get errors for internal nodes?
d
It is given that dx g(x) = g(x)(1 − g(x)) and we can back propagate as follows:
Algorithm for learning the weights:
(l)
Training set {((x(1) , y (1) ), ..., (x(m) , y (m) ) } Set ∆ij = 0 (for all l, i, j)
For i = 1 to m {
set a(1) = x(i)
Set a(1) = x(i)
Perform forward propagation to compute a(l) for l = 2,3,...,L
Using y (i) , compute δ (L) = a(L) − y (i)
Compute δ (L−1) , δ (L−2) , ..., δ (2)
(l) (l) (l) (l+1)
∆ij := ∆ij + aj δi
}
(l) 1 (l) (l)
Dij := m [∆ij + λΘij ] if j 6= 0
(l) 1 (l)
Dij := m [∆ij if j = 0
δ (l)
(l) J(Θ) = Dij
δΘij

6.5.3 Properties Of Neural Networks

• Useful for modelling complex, non-linear function of numerical inputs and outputs

– symbolic inputs/outputs represented using some encoding

– 2 or 3 layer networks can approximate a huge class of functions (if enough neurons in hidden layers)
• Robust to noise; but risk of over fitting (due to high expressiveness)! e.g. training for too long. Usually handled
using validation sets.

• All inputs have some effect: Decision trees: selection of most important attribtutes, ANN ”Selects” attributes
by giving them higher/lower weights
• Explanatory power of ANNs is limited
– Model represented as weights in network
– No simple explanation why network makes a certain prediction (cf. trees can give a rule that was used)
– Networks can not easily be translated into a symbolic model (tree, ruleset)

Use ANNs when:

• High dimensional input and output (numeric or symbolic)
• Interpretability of model unimportant

27
7 Lecture 7: Recommender Systems
7.1 Collaborative Filtering
In short what this means is that we look at what other users/customers liked/rated and try to use this information
to recommend other products.

7.2 Content Based Approach

Given a list of films:

Movie Θ(1) Alice(1) Θ(2) Bob(2) Carol(3)

Love at last 5 5 0 0
Romance Forever 5 ? ? 0
Cute puppies of Love ? 4 0 ?
Nonstop car chases 0 0 5 4
Swords vs. karate 0 0 5 ?

• Now the Optimization Criterion: To learn Θ(j) (parameter for user j)

2 n
1 λ (j)
(Θ(j) )T x(i) − y (i,j) (Θk )2
P P
– minΘ(j) 2 + 2
i:r(i,j)=1 k=1

• Now in order to learn all parameters (Θ(1) , Θ(2) , ..., Θ(nu ) :

nu 2
1
(Θ(j) )T x(i) − y (i,j)
P P
– minΘ(1 ),...,Θ(nu ) 2
j=1 i:r(i,j)=1

Note: the 2nd formula combines the knowledge from all users!

• So now we can update the gradient descent algorithm for this case:

!
(j) (j) (i)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α for k = 0
i:r(i,j)=1
!
(j) (j) (i) (j)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α + λΘk for k 6= 0
i:r(i,j)=1

7.3 Collaborative Filtering

Given x(1) , ..., xnm estimate Θ(1) , ..., Θ(nu ) :
nu 2 nu P
n
1 λ (j)
(Θ(j) )T x(i) − y (i,j) (Θk )2
P P P
• minΘ(1) ,...,Θ(nu ) 2 + 2
j=1 i:r(i,j)=1 j=1 k=1

Given Θ(1) , ..., Θ(nu ) estimate x(1) , ..., x(nm ) :

n m 2 n m n
1 λ (i)
(Θ(j) )T x(i) − y (i,j) (xk )2
P P P P
• minx(1) ,...,x(nm ) 2 + 2
j=1 j:r(i,j)=1 j=1 k=1

Estimating x(1) , ..., xnm and Θ(1) , ..., Θ(nu ) simultaneously:

• J(x(1) , ..., x(nm ) , Θ(1) , ..., Θ(nu ) ) =

nm n nu P
n
1 λ (i) λ (j)
((Θ(j) )T x(i) − y (i,j) )2 + (xk )2 + (Θk )2
P P P P
2 2 2
(i,j):r(i,j)=1 i=1 k=1 j=1 k=1

28
7.3.1 Collaborative Filtering Algorithm
1. Initialize the input featuers x(1) , ..., xnm , and weights Θ(1) , ..., Θ(nu ) to small random values.
2. Minimize the cost function J(x(1) , ..., xnm , Θ(1) , ..., Θ(nu )) using gradient descent (or another optimization al-
gorithm).

3. For a user with (learned) parameter Θ and a movie with (learned) features x, predict a star rating of ΘT x.

7.3.2 Mean Normalization

Brand new users will receive a prediction of 0 (not very useful). In order to avoid this we can normalize the
mean. What we do is, we calculate the average rating, and we normalize by subtracting the average rating from the
set of ratings that each existing user has given so far. Then for user j, on movie i predict: (Θ(j) )T (x(i) ) + µi So if
there is no information from a user we give recommendations equal to the average rating!

7.4 Support Vector Machines

Usable in similar situations as neural networks. Important concepts:

• Finding a ”maximal margin” separation.

• Transformation into high dimensional space.

7.4.1 Linear SVMs

The idea is to find a hyperplane that discriminates + from - where the margin/distance of hyperplane to closest
points is maximal. The solution is unique and determined by just a few points (Support vectors).

7.4.2 Non-Linear SVMs

1. Transfrom data to a higher-dimensional space where they are hopefully linearly separable.
2. Learn linear SVM in that space.
3. Transform linear SVM back to original space.

7.4.3 Logistic Regression to SVM

Alternative view on logistic regression:
1 1
• Cost of example: −(y log hΘ (x) + (1 − y) log(1 − hΘ (x))) = −y log 1+e−ΘT x
− (1 − y)log(1 − 1+e−ΘT x
)

This can be done for similar reasons why we would use logistic regression in other classification cases.

7.4.4 Kernels
Pick data points in the space (named landmarks). Idea is that by applying a positive or negative weight to the
distance to a data point/kernel we can predict whether or not a new instance is a class:

• predict y = 1 if:
– Θ0 + Θ1 f1 + Θ2 f2 + ... + Θi fi ≥ 0
• given x:
(i)
||2
– fi = similarity(x, l(i) ) = exp(− ||x−l
2δ 2 ) where l(i) is kernel i

29
7.4.5 Cost Function
Hypothesis: Given x, compute features f ∈ Rm+1 :
• predict ”y=1” if ΘT f ≥ 0

Training:
m n
1
y (i) cost1 (ΘT f (i) ) + (1 − y (i) )cost0 (ΘT f (i) ) + Θ2j
P P
• minΘ C 2
i−1 j=1

7.5 Compare SVM

It is interesting to comapre SVM with:
• Multi-layered Neural Networks:
– Perceptron: linear separation, not with maximal margin.
– ANN obtains better expressiveness by changing representation throughout its layers.
– SVM obtains better expressiveness through non-linear transformation.
• Instance Based Learning:
– SVM stores examples that identify boundary between classes; classification based on which side of the
boundary new example is.
– IBL: stores all examples; classification based on distance to stored examples.

30
8 Lecture 8:
8.1 Nearest Neighbor Algorithm
Idea: Instances that lie ”close” to each other are most likely similar to each other.

8.1.1 Properties
• Learning is very fast
• No info is lost (brings disadvantage: ”Details” may be noisy)
• Hypothesis space:

– Variable size
– Complexity of the hypothesis rises with the number of stored examples

8.1.2 Decision Boundaries

The sample space is basically ”cut” into pieces for all data points. These boundaries are not computed!
So in essence we keep all information. However this comes with the problem that more details means more noise.
In order to improve robustness against noisy learning examples we use a set of nearest neighbors. For classification
we use voting, and for regression we use the mean.
The method in the book contains a mistake, see the slide about the book.

8.1.3 Lazy vs Eager Learning

Lazy learning: Don’t do anything until we need to make a prediction (e.g. Nearest Neighbor)
• Learning is fast

• Predictions require work and can be slow

Eager learning: Start computing as soon as we receive data. (Decision tree, neural networks etc.)
• Learning can be slow
• predictions are usually fast!

8.1.4 Inductive vs Transductive learning

Induction: for input x find a model/function to calculate y.
• Computations take only learning data into account
• a single model must work well for all new data: global model
Transduction: for input x find some output y

• computations can take extra info about the needed predictions into account.
• Can use local models that work well in the neighborhood of the target example.

8.1.5 Semi-Supervised Learning

The learner gets a set of labeled data and a set of unlabeled data. Information about the probability distribution of
examples can help the learner. Seen the little info on the slides this is probably not important.

31
8.1.6 Distance Definition
The representation of the data is very critical. This makes or breaks the NN algorithm.
For example s Manhattan, Euclidean or Ln − norm for numerical attributes:
#dim
Ln (x1 , x2 ) = n
P
|x1,i − x2,i |n
i=1
Hammings distance for nominal attributes:
n
P
d(x, y) = δ(xi , yi )
i=1
where δ(xi , yi ) = 0 if xi = yi , and δ(xi , yi ) = 1 if xi 6= yi

8.1.7 Normalization of Attributes

In order to avoid problems we normalize the attribute values. if we do this in order to capture 5 nearest neighbors
we need:
• 1 dim: 0.1% of the range
√ 1
• 2 dim: 0.1% = 0.1% 2 = 0.3% of the range
1
• n dim: 0.1% n
This is also called the curse of dimensionality.

8.1.8 Weighted Distances

Curse of Noisy Features: Big data sets with e.g. 10 dimensions already require almost 60% of the range. Therefore
irrelevant data destroy the metric’s meaningfulness.
But of course
s we have a solution for this: Weighted Distances!
PD
dw (x, y) = wj (xj − yj )2
j=1
Selecting attribute weights. We have several option:
• experimentally find out which weighs work well (cross-validation)
• Other solutions, e.g. Langley, 1996:
1. Normalize attributes (to scale 0-1)
2. Select weights according to ”average attribute similarity within class”

8.1.9 More distances

• Strings: Levenshtein distance/edit distance = minimal number of changes to change one word to the other.
Allowed edits: delete, insert, change.
s
Pn
• Euclidean: D(Q, C) ≡ (qi − ci )2 (Pythagoras!)
i=1

• Sequence Distances:
– Dynamic Time Warping: Sequences are aligned ”one to one” (non linear alignments are possible)
– Dimensionality reduction

32
8.2 Distance-weighted kNN
Idea: give higher weight to closer instances so we can now use all training instances instead of only k aka ”Shepard’s
method”.
k
P
wi f (xi )
• fˆ(xq ) = i=1
Pk with wi = 1
d(xq ,xi )2
i=1 wi

This results in a fast learning algorithm but it has slow predictions. Efficiency:
• for each prediction, kNN needs to compute the distance for ALL stored examples.
• Prediction time = linear in the size of the data set, for large training sets and/or complex distances this can
be too slow to be practical.

8.2.1 Edited k-nearest neighbor

• Less storage (good).
• Order dependent (bad).
• Sensitive to noisy data (bad).

• More advanced alternatives exist (= IB3).

The algorithm:
Incremental Deletion of Examples
Edited k-NN(S) S: Set of instances
For each instance x in S if x is correctly classified by S\x
Remove x from S
Return S

Incremental addition of examples Edited k-NN(S) S: Set of instances

T =∅
For each instance x in S
if x is not correctly classified by T
Add x to T
Return T

8.3 Pipeline Filters

Pipeline filters: Reduce time spent on far-away examples by using more efficient distance-estimates first. We can
eliminate most examples using rough distance approximations and compute more precise distances for examples in
the neighborhood.

8.4 kD-trees
kD-trees: use a clever data structure to eliminate the need to compute all distances. kD-trees are similar to decision
trees except:
• Splits are made on the median/mean value of dimension with highest variance
• Each node stores one data point, leaves can be empty

Finds closest neighbor in logarithmic (depth of tree) time. However building a good kD-tree may take some time:
Learning time is no longer 0 and incremental learning is no longer trivial:
• kD-tree will no longer be balanced
• re-building the tree is recommended when the max depth becomes larger than 2 * the minimal required depth
(= log(N) with N training examples).

33
Using Prototypes: the rough decision surfaces of nearest neighbor can sometimes be considered a disadvantage. We
can solve two problems at once by using prototypes (= representative for a whole group of instances) For example
prototypes can be:
• single instances, replacing a group
• other structure, (e.g., rectangle/shape, rule, ..)
• Radial basis function networks Basically builds a global approximation as linear combination of local approxi-
Pk
mations. f (x) = w0 + wu Ku (d(xu , x))
u=1
−1
d2 (xu ,x)
A common choice for Ku (d(xu , x)) = e 2δu2 . by using this the influence of each local approximation u
goes down quickly with distance.

8.5 Local Learning

• Collect k nearest neighbors
• Give them a supervised algorithm
• Apply learned model to test example

Locally weighted Regression Build local model in region around x (e.g. linear or quadratic model). Minimiz-
ing:
(f (x) − fˆ(x))2 .
P
• squared error for k neighbors: E1 (xq ) ≡
x∈kN N (xq )

(f (x) − fˆ(x))2 K(d(xq , x))

P
• Distance-weighted squared error for all neighbors: E2 (xq ) ≡
x∈D

8.6 Comments on k-NN

Positive
• Easy to implement
• Good ”baseline” algorithm / experimental control
• Incremental learning easy
• Psychologically plausible model of human memory
Negative
• Led astray by irrelevant features
• No insight into domain (no explicit model)
• Choice of distance function is problematic
• Doesn’t exploit/notice structure in examples

8.7 Decision Boundaries

Basically tries to make a partition (of as few divisions as possible) of the version space that indicates for each partition
to what class it belongs.

8.8 Sequential Covering Approaches

Also known as the ”Separate and Conquer” approach. General principle: Learn a rule set one rule at a time. It tries
to learn one rule that has a high accuracy (when it predicts something, it should be correct) and any coverage (does
not make a prediction for all examples just for some of them). Then mark the covered examples (these have been
taken care of; now focus on the rest). Repeat until all examples covered.

34
8.8.1 Candidate Literals
There are two separate methods to determining candidate literals for these algorithms.

Top-Down Learn One Rule

For this algorithm, we simply go through all of the possible combinations of categories and their values, i.e. (wind =
weak), (wind = strong), (temp = mild), (temp = cool), (humidity = normal), (humidity = high) are all the possible
candidate literals for the above algorithm from the example in the homework assignment.

Top-down Example-driven Learn One Rule

For this algorithm, we want to find the literals that have the highest accuracy. First, we select an arbitrary example
e (usually starting with e1 ) and we find out which literal value has the highest accuracy. For example, if (humidity
= normal) has accuracy of 34 , we count that as our first literal. However, because the accuracy is not 100%, we must
find a second example such that (hum = norm) AND (literal #2) have 100% accuracy.
In the example in the homework, when (temp = mild) it has 23 positive cases, where 22 are covered when in
conjunction with (hum = norm). Therefore, the first rule is:
1. IF (hum = norm) AND (temp = mild)
which covers e1 and e2 . However, this does not cover all positive cases. There still exists a third positive
example (e3 ). In this example, when (wind = weak) there are 22 positive examples. Since when (wind = weak)
we cover all remaining positive examples (e3 ), which leads to the conclusion that the second rule is:
2. IF (wind = weak)
and we are done.

8.8.2 Sequential covering

function LearnRuleSet(Target, Attrs, Examples, Threshold):
LearnedRules:= ∅
Rule:= LearnOneRule(Target, Attrs, Examples)
while performance(Rule,Examples)S> Threshold, do
LearnedRules := LearnedRules {Rule}
Examples := {Examples} \ {examples classified correctly by Rule}
Rule := LearnOneRule(Target, Attrs, Examples)
Optional: Sort learned rules according to performance
return LearnedRules

Learning One Rule

• Perform greedy search

• Could be top-down or bottom-up

– Top-down:
∗ Start with maximaly general rule (has maximal coverage but low accuracy)
∗ Add literals one by one
∗ Gradually maximize accuracy without sacrificing coverage (using some heuristic)
Top down has typically more general rules
– Bottom-up:
∗ Start with maximally specific rule (has minimal coverage but maximal accuracy)
∗ Remove literals one by one
∗ Gradually maximize coverage without sacrificing accuracy (using some heuristic)
Bottom up has typically more specific rules

35
8.8.3 Heuristics
When is rule considered a good rule?
• High accuracy

• High coverage (less important than accuracy)

Possible evaluation functions:
p
• Accuracy: p+n where p=# positives, n=# negatives
p+mq
• Variant on Accuracy: m-estimate: p+n+m . Weighted mean between accuracy on covered set of examples and
a priori estimate of true accuracy q (m is weight).

• Entropy: more symmetry between positive and negative

8.9 Example-driven Top-down Rule induction

Idea: for a given class c:
As long as there are uncovered examples for C
• pick one such example e
• consider He = rules that cover this example
• search top-down in He to find best rule

Much more efficient search (He much smaller than H (set of all rules).
Less robust with respect to noise; noisy example may require a restart.

8.10 Avoiding over-fitting

Post-pruning:
1. Split instances into Growing Set and Pruning Set
2. Learn set SR of rules using Growing Set

3. Find the best simplification BSR of SR

4. while(Accuracy(BSR, Pruning Set) ¿ Accuracy(SR, Pruning Set)) do
(a) SR = BSR
(b) Find the best simplification BSR of SR

5. return BSR

36
9 Lecture 9: Clustering
9.1 Unsupervised Learning
Data just contains x, there is no given classification or other information. The main goal is to find structure in the
data. The definition of ground truth is often missing (no clear error function like in supervised learning.

9.2 Clustering
Problem definition:
Let X = (x1 , x2 , ..., xd ) be a d-dimensional feature vector.
Let D be a set of vectors, D = X1 , X2 , ..., Xn Given data D, group the N vectors into K groups such that the
grouping is optimal.
Clustering is used for:
• Establish prototypes or detect outliers

• Simplify data for further analysis/learning

• Visualize data
• Preprocessing step for algorithms

• stand alone tool to get insight into data distribution

A good clustering method will produce clusters with
• High intra-class similarity
• Low inter-class similarity

• precise definition of clustering quality is difficult (application-dependent and ultimately subjective)

9.3 Similarity Measures

Possible options

• Distance Metric (Ln metric, ...)

• More general forms of similarity (Do not necessarily satisfy triangle inequality, symmetry, ...)

9.4 Flat vs. Hierarchical Clustering

Flat clustering: Given data set, return partition Hierarchical Clustering:
• Combine clusters into larger clusters, etc. until 1 cluster = full data set
• Gives rise to cluster hierarchy or taxonomy (taxonomy = grouping of classes; e.g. mammals - Felines - Tigers
etc.)

9.5 Extensional vs Intensional Clustering

Extensional clustering: Clusters are defined as sets of examples. Intensional clustering Clusters described in
some language. Typical criteria for good intensional clustering:

• High intra cluster similarity

• Simple conceptual description of clusters.

37
9.6 Cluster Assignment
• Hard clustering: Each item is a member of one cluster
• Soft Clustering: Each item has a probability of membership in each cluster
• Disjunctive clustering: An item belongs to only one cluster
• An item can be in more than one cluster
• Exhaustive clustering: Each item is a member of a cluster
• Partial Clustering: Some items do not belong to a cluster (in practice this is equal to exhaustive clustering
with singleton clusters)

9.7 Major Clustering Approaches

• Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion
• Partitioning: Construct various partitions and then evaluate them by some criterion
• Model-based: Hypothesize a model for each cluster and find best fir of models to data
• Density based: Guided by connectivity and density functions

9.8 Hierarchical Clustering

Can do top-down (devisive) or bottom-up (agglomerative). In either case we maintain a matrix of distance (or
similarity) scores for all pairs of instances, clusters (formed so far) or both.

9.8.1 Dendogram
Tree view on hierarchical clusters; how higher the topbar is (horizontal line) the higher the degree of difference within
cluster.

9.8.2 Bottom up Hierarchical Clustering

Given: instances x1 , ..., xn
for(i=1 to n) ci = {xi }
C = {c1 , ..., cn }
j=n
while size of C ≥ 1
j = j+1
(ca , cb ) =
S argminu,v dist(cu , cv )
cj = ca cv
add node to tree joining a and b
C = C \{ca , cb } ∪ cj
Return tree with root node j

9.9 Distance between two clusters

The distance between two clusters can be determined in several ways
• Single link: Distance of two most similar instances: dist(cu , cv ) = min{dist(a, b) | a ∈ cu , b ∈ cv }
• Complete link: distance of 2 least similar instances: dist(cu , cv ) = max{dist(a, b) | a ∈ cu , b ∈ cv }
• Average link: average distance between instances: dist(cu , cv ) = avg{dist(a, b) | a ∈ cu , b ∈ cv }
Computational complexity: Naive implementation has O(n3 ) time complexity, where n is the number of instances.
More advanced computations:
• Single link: Can update and pick pair in O(n), which results in O(n2 ) algorithm
• Complete and average link: Can do these steps in O(n log n), which yields an O(n2 logn) algorithm.

38
10 Lecture 10:
10.1 Reinforcement learning
Reinforcement learning stems from the situation where an agent only receives a reward after a sequence/series of
actions have been performed. It stems from biological and societal systems where an agent is given a reward (i.e.
Dopamine) based on a previous decision(s), instead of given constant guidance towards what is the correct or incorrect
decision.
In reinforcement learning, the agent typically does not possess full knowledge of the environment or the result of
each action. More formally:

• Given:
1. a Set of States S (known to the agent only after exploration)
2. a Set of Actions A (per state)
3. a Transition function: St = δ(st , at ) (unknown to agent) where δ represents the state transition
4. a Reward function: rt = r(st , at ) (unknown to agent)
• Find:

1. Policy π : S → A that outputs an appropriate action a from set A, given the current state s from set S
such that π(st ) = at .

10.2 Optimal Policy

The optimal policy π is found by maximizing the cumulative value/reward:
∞
• V π (st ) = rt + γrt+1 + γ 2 rt+2 + ... ≡ γ i rt+i
P
i=0

where gamma 0 ≤ γ ≤ 1 is a ”discount factor” that leads us to prefer either immediate reward or delayed reward
(higher values of γ → later reward preference). Therefore, the optimal policy becomes:
∗
• π ∗ ≡ ArgM axa V π (s), (∀s) where V π is the value function of the optimal policy for state s:
∗
• V π (s) or V ∗ (s)

However, this demonstrates a problem. How can we learn the optimal policy π ∗ : S → A for arbitrary environ-
ments? Since training data hs, ai is not available, π ∗ cannot be directly learned because the agent can only directly
choose a and not s. This leads us to the concept of Q-Learning.

10.3 Q-learning Algorithm

Q-learning does not require a model aka it is model-free. It is also exploration-independent (off-policy).

10.3.1 Q-Learning Intuition

We want to maximize the sum of the rewards, doing maximization iteratively while exploring state-action pairs (s,
a) to explore the cumulative reward:

• π ∗ ≡ ArgM axa [r(s, a) + γV ∗ (δ(s, a))]

The problem with this is that the agent typically does not have perfect knowledge of δ (the state transitions) or
r (the reward in all states). This means that agents cannot predict the reward and the immediate successor state
→ V ∗ cannot be directly learned directly. Solution - learn the Q-values instead by computing the optimal
Q-values for all state-action pairs using the Bellman equation:

• Q(s, a) ← r + γmax( α0 )Q(s0 , a0 ) so the optimal policy becomes:

• π ∗ ≡ ArgM axa Q(s, a)

39
10.3.2 Learning the Q-Values
We use iterative approximation to learn the Q values for a given state-action pair:
• V ∗ = M axa0 Q(s, a0 )
So that we can rewrite:
• Q(s, a) = r(s, a) + γM axa0 Q(δ(s, a), a0 )
And we then obtain the recursive update rule that allows an iterative approximation of Q:
• Q̂(s, a) ← r(s, a) + M axa0 Q̂(s0 , a0 )
This way, the agent stores the value Q̂(s, a) in a large look-up table. Then the agent repeatedly observes its own
current state s, chooses some action a, and observes the resulting reward r(s, a) and the new state s0 = δ(s, a). This
way, the agent repeatedly samples from unknown functions δ(s, a) and r(s, a) without having full knowledge of these
functions.

10.3.3 Q-Learning Optimality

In deterministic environments (if the next state is perfectly predictable given knowledge of the previous state and the
agent’s action), Q-Learning is guaranteed to converge for infinite amounts of updates of each state-action
pair. In practice, infinite amounts of updates are not required to determine the optimal policy.

10.3.4 Accelerating the Q-Learning Process

One way to accelerate this process is to back-propagate the Q-values after a visit of a sequence of states. For this
you have to remember previously visited states within one run.
In this case, do we choose the next action that maximizesQ̂(s, a)? NO, because this risks the situation where no
new values are learned and it can become biased by initial random exploration, meaning that Q-Learning would
not converge.
The better choice is to balance exploration with the explotation of known Q-values. A probabilistic model:
kQ̂(s,ai )
• P (ai |s) = P Q̂(s,a )
j
K
j

Actions with higher Q̂(s, a) are more likely to be picked compared to other actions. High k = higher exploitation
factor, lower k = higher exploration factor.

10.3.5 Q-Learning Summary

• Q-Learning is model-free: Q-Learning does not need any information about the environment except for the set
of valid actions for each state.
• Given a chosen state-action pair, the environment will provide the rewards.
• Once these are given, a reinforcement learning technique such as Q-Learning explores the environment and the
connected reward autonomously and thus performs autonomous learning of the optimal policy.
• Q-Learning is guaranteed to converge given infinite iterations. It however converges in a reasonable
number of iterations.

10.4 Online Learning and SARSA

An off-policy learner learns the value of the optimal policy independently of the agent’s actions. An on-policy
learner learns the value of the policy being carried out by the agent, including the exploration steps.
Limitation of off-policy learning: There may be cases where ignoring what the agent actually does is dangerous (there
will be large negative rewards).
SARSA chooses the action that was actually chosen by the agent (rather than the best possible action argmaxQ(s, a)).
• Can take exploration into account
• Online and continuous learning

40
10.5 Expectation Maximization
Given the statistical model which generates a set X of observed data, a set of unobserved latent data or missing
values Z , and a vector of unknown parameters θ, along with a likelihood function L(θ);X ,Z) = p(X, Z|θ), the
maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the
observed data:

Z
• L(θ; X) = p(X|θ) = p(X, Z|θ)dZ

However, this quantity is often intractable (e.g. if z is a sequence of events, so that the number of values grows
exponentially with the sequence length, making the exact calculation of the sum extremely difficult).
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

• Expectation step (E step):

– Calculate the expected value of the log likelihood function, with respect to the conditional distribution of
Z given X under the current estimate of the parameters Θ(t) :
– Q(θ|θ (t) ) = EZ|X,θ(t) [log L(θ; X, Z)]
• Maximization step (M step): Find the parameters that maximize this quantity:

– θ (t+1) = arg max Q(θ|θ (t) ) θ (t+1) = arg max Q(θ|θ (t) )
θ θ

The typical models to which EM is applied uses Z as a latent variable indicating membership in one of a set of
groups:
The observed data points x may be discrete (taking values in a finite or countably infinite set) or continuous
(taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. The
missing values (aka latent variables) Z are discrete, drawn from a fixed number of values, and with one latent variable
per observed unit. The parameters are continuous, and are of two kinds: Parameters that are associated with all
data points, and those associated with a specific value of a latent variable (i.e., associated with all data points which
corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models.

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Machine Learning Algorithms Applications and Practices in Data Science PDF
No ratings yet
Machine Learning Algorithms Applications and Practices in Data Science PDF
113 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
Predictive Modelling - Final Project Report-Logistic Regression and LDA
100% (1)
Predictive Modelling - Final Project Report-Logistic Regression and LDA
25 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
Ensemble Methods in Data Mining
No ratings yet
Ensemble Methods in Data Mining
127 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Machine Learning
No ratings yet
Machine Learning
216 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
Poly ML SIR
No ratings yet
Poly ML SIR
378 pages
Machine Learning Complete-Course-Notes Polimi
No ratings yet
Machine Learning Complete-Course-Notes Polimi
107 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Foundations of Machine
No ratings yet
Foundations of Machine
120 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
Active Learning
100% (3)
Active Learning
116 pages
Machine Leaning and Dimensionality Reduction Course UCLouvain
No ratings yet
Machine Leaning and Dimensionality Reduction Course UCLouvain
36 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
Machine Learning Contents 2
No ratings yet
Machine Learning Contents 2
7 pages
Orange 3
100% (1)
Orange 3
46 pages
Data Mining Notes
100% (1)
Data Mining Notes
178 pages
MachineLearning 1 1
No ratings yet
MachineLearning 1 1
81 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
Active Learning Book
No ratings yet
Active Learning Book
116 pages
Vorlesung Main Compressed
No ratings yet
Vorlesung Main Compressed
1,437 pages
Mlt End Term Quizzes
No ratings yet
Mlt End Term Quizzes
166 pages
Supervised Classification and Mathematical Optimization
No ratings yet
Supervised Classification and Mathematical Optimization
16 pages
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
No ratings yet
Theoretical Bioinformatics and Machine Learning - Hochreiter - 2013
400 pages
Exercises
No ratings yet
Exercises
69 pages
Mini Project 2024
No ratings yet
Mini Project 2024
48 pages
MLbook Extract
No ratings yet
MLbook Extract
14 pages
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
No ratings yet
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
165 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Undergraduate Fundamentals of Machine Learning
No ratings yet
Undergraduate Fundamentals of Machine Learning
163 pages
Cs181 Textbook
No ratings yet
Cs181 Textbook
163 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
275 pages
Active Sample Selection For Matrix Compl
No ratings yet
Active Sample Selection For Matrix Compl
89 pages
Textbook
No ratings yet
Textbook
161 pages
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
No ratings yet
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
204 pages
10 1 1 672 7118 PDF
No ratings yet
10 1 1 672 7118 PDF
35 pages
mlr3 Tutorial
100% (2)
mlr3 Tutorial
271 pages
User Guide 0.16.1 PDF
No ratings yet
User Guide 0.16.1 PDF
2,160 pages
Pattern Recognition 2nd Ed. (2009)
No ratings yet
Pattern Recognition 2nd Ed. (2009)
113 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
0975 Data Science and Machine Learning
No ratings yet
0975 Data Science and Machine Learning
6 pages
Python Data Science
100% (1)
Python Data Science
173 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
Machine Learning The Basics
No ratings yet
Machine Learning The Basics
158 pages
Math Foundations of Machine Learning Mississippi SU
No ratings yet
Math Foundations of Machine Learning Mississippi SU
328 pages
PCML Notes
No ratings yet
PCML Notes
249 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Detection of Dyslexia Using Eye Tracking Measures
No ratings yet
Detection of Dyslexia Using Eye Tracking Measures
4 pages
Feature Selection For Machine Learning Based Iot Botnet Attack Detection
No ratings yet
Feature Selection For Machine Learning Based Iot Botnet Attack Detection
98 pages
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
No ratings yet
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
9 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
Waste Management: Sama Azadi, Ayoub Karimi-Jashni
No ratings yet
Waste Management: Sama Azadi, Ayoub Karimi-Jashni
10 pages
Gumpy Tutorial
No ratings yet
Gumpy Tutorial
26 pages
Object Recognition System Design in Computer Vision: A Universal Approach
No ratings yet
Object Recognition System Design in Computer Vision: A Universal Approach
18 pages
Classification Error: Training Errors Generalization Errors
No ratings yet
Classification Error: Training Errors Generalization Errors
39 pages
Thesis Lu Dai
No ratings yet
Thesis Lu Dai
85 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Code Blue Early Predictor - Somanchi
No ratings yet
Code Blue Early Predictor - Somanchi
8 pages
Modeling Inverse Kinematics in A Robotic Arm - MATLAB & Simulink Example
No ratings yet
Modeling Inverse Kinematics in A Robotic Arm - MATLAB & Simulink Example
5 pages
Monitoring Solar Panels Using Machine Learning Techniques
No ratings yet
Monitoring Solar Panels Using Machine Learning Techniques
6 pages
Drilling Efficiency Improvement and Rate of Penetration Optimization by Machine Learning and Data Analytics
No ratings yet
Drilling Efficiency Improvement and Rate of Penetration Optimization by Machine Learning and Data Analytics
14 pages
SpineNet - Learning Scale-Permuted Backbone For Recognition and Localization
No ratings yet
SpineNet - Learning Scale-Permuted Backbone For Recognition and Localization
11 pages
Data Mining: Accuracy and Error Measures For Classification and Prediction
No ratings yet
Data Mining: Accuracy and Error Measures For Classification and Prediction
15 pages
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
No ratings yet
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
4 pages
Machine Learning Internshala: Mini Project / Internship Report
100% (1)
Machine Learning Internshala: Mini Project / Internship Report
28 pages
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
No ratings yet
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
47 pages
A Deep Learning Approach For Road Damage Detection From Smartphone Images
No ratings yet
A Deep Learning Approach For Road Damage Detection From Smartphone Images
4 pages
Data Mining
No ratings yet
Data Mining
49 pages
Predicting Players' Performance in One Day International Cricket Matches Using Machine Learning
No ratings yet
Predicting Players' Performance in One Day International Cricket Matches Using Machine Learning
17 pages
Log-Based Anomaly Detection Without Log Parsing: Van-Hoang Le and Hongyu Zhang
No ratings yet
Log-Based Anomaly Detection Without Log Parsing: Van-Hoang Le and Hongyu Zhang
13 pages
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
No ratings yet
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
17 pages
Sface: An Efficient Network For Face Detection in Large Scale Variations
No ratings yet
Sface: An Efficient Network For Face Detection in Large Scale Variations
17 pages
DSBDA ORAL Question Bank
100% (1)
DSBDA ORAL Question Bank
6 pages
Svmlight: - Svmlight Is An Implementation of Support Vector Machine (SVM) in C. - Download Source From
No ratings yet
Svmlight: - Svmlight Is An Implementation of Support Vector Machine (SVM) in C. - Download Source From
8 pages
Yarbus (Greene, Liu, Wolfe)
No ratings yet
Yarbus (Greene, Liu, Wolfe)
8 pages