2017 Machine Learning Summary v4 PDF
2017 Machine Learning Summary v4 PDF
Contents
1 Lecture 1: Version Spaces 5
1.1 Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Conjunction of Discrete Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Find s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 List elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Boundary Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Candidate Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8.1 Picking training instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.2 Unanimous-Voting rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.3 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.4 Unanimous Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Volume Extension Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9.1 In Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 K-Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
3 Lecture 3: Evaluation of Learning Models 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Evaluation of Classifiers Evaluation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Confidence Intervals for Estimates on Classification Performance . . . . . . . . . . . . . . . . . 16
3.2.4 Metric Evaluation TL;DR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Comparing Data-Mining Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Counting the Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Generating a Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 ROC Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Iso-Accuracy Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.3 Contructing ROC Curve for 1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.4 Area Under Curve Metric (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2
7 Lecture 7: Recommender Systems 28
7.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Content Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3.1 Collaborative Filtering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.3.2 Mean Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.1 Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.2 Non-Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.3 Logistic Regression to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.5 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.5 Compare SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8 Lecture 8: 31
8.1 Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.3 Lazy vs Eager Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.4 Inductive vs Transductive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.5 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.1.6 Distance Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.7 Normalization of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.8 Weighted Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.9 More distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2 Distance-weighted kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2.1 Edited k-nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Pipeline Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.4 kD-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Local Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.6 Comments on k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.7 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8 Sequential Covering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.8.1 Candidate Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.2 Sequential covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.8.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.9 Example-driven Top-down Rule induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.10 Avoiding over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9 Lecture 9: Clustering 37
9.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.4 Flat vs. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.5 Extensional vs Intensional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.6 Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.7 Major Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.1 Dendogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.8.2 Bottom up Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.9 Distance between two clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
10 Lecture 10: 39
10.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.2 Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3 Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.1 Q-Learning Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.3.2 Learning the Q-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.3 Q-Learning Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.4 Accelerating the Q-Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3.5 Q-Learning Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.4 Online Learning and SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.5 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4
1 Lecture 1: Version Spaces
Version space learning is a logical approach to machine learning, specifically binary classification. Version space
learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the
hypothesis space is a disjunction:
• H1 ∨ H2 ∨ ... ∨ Hn
(i.e., either hypothesis 1 is true, or hypothesis 2, or any subset of the hypotheses 1 through n). A version space
learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x,
the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space
is called the candidate elimination algorithm (see 1.8), the hypothesis space maintained inside the algorithm its
version space.
Overview
• Classification Task
• FindS algorithm
• Version Spaces
• List Elimination Algorithm
• Boundary Sets and Candidate Elimination Algorithm
• Properties of Version Spaces
• Inductive Bias
• Version Spaces and Consistency Tests
• Volume Extension and k-Version Spaces
• Classifiers are a set of elements that indicate that an object belongs to a certain class.
• The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be
returned by it (as being true).
So a classification task consists out of 4 properties: X, Y, H and D where: X:= The version space Y:= The evaluation
of an object in the version space (done by H) H:= The hypothesis space D:= The training data
Binary Classification task; e.g. |Y | = 2 Multi-Class Classification task; e.g. |Y | >= 2
5
1.3 Conjunction of Discrete Attributes
How to generalize a hypothesis (h) with respect to an instance (x)?
For every attribute Ai in the hypothesis h where Ai is specified and contradicts the instance x: Set Ai of h to ?
(unspecified).
And how do we make it more specific? First we create an empty set that we call the specializations. Assuming
that the instance x is a positive object;
For every attribute value v of Ai of h that is not specified (=?) we create a specialization s that is equal to h and
set the value of attribute Ai of s to v. We then set the specializations set to be the union of itself and s. (end for)
6
∗ specific boundary SH consistent with E
– Local
∗ G: set of hypotheses in H
∗ S: set of hypotheses in H
– Let G={true}, S={false};
1. for each e ∈ E do:
(a) if (e is a positive example) then compare e to Gi−1 (of the previous example).
i. Elements of G that classify e as negative are removed from G;
ii. Each element g in Gi−1 that contradicts with the same element in example e is removed from
the new general set G for example e.
iii. Non-maximal hypotheses are removed from S;
(b) else if (e is a negative example) then compare e to S of previous example:
i. Elements of S that classify e as positive are removed from S;
ii. Each element s of Si−1 that contradicts with the same element in the negative example e goes
into a new general set G where the contradicting element is the only specific element, and all
other elements are marked with a ?. If there are multiple elements e that contradict with the
same element in S, a new general set G is made. All contradictions get their own set G with
only ?’s and the single contradicting element.
∗ Each new general set is bound to the specific contradiction of the previous S.
∗ Then we eliminate from the new S (belonging to ei ) the negative elements in e that align
with the specific set S from the previous example.
iii. Non-minimal hypotheses are removed from G.
• Definition 3: Volume V(VS(D)) of version space VS(D) is the set of all instances that are not classified by
VS(D).
7
1.8.3 Inductive Bias
Completeness of a version space: Version Space = complete ↔ for any dataset D there exists a hypothesis in H s.t.
H is consistent with D
Now the inductive Bias of Version Spaces is the assumption that a version space is incomplete! So when do we
speak of a correct inductive bias? Well, that is when the target hypothesis t is in the hypothesis space H and the
training data are noise free (all fields are known and are correct). According to the internet: Inductive Bias = The
assumption that the target concept is contained in the hypothesis space (doesn’t this contradict the slides? (above
statement))
However, upon reviewing this with someone we concluded that it is possible that the inductive bias is simply
the set of rules that we found from inductive learning over the training data which can then be used to classify new
instances. WE THINK!
1.8.5 Accuracy
So when can we reach 100% accuracy and when not? Well there are 3 cases:
• Case 1: Data is noise free and the hypothesis space H contains the target classifier. (100% accuracy)
• Case 2: The hypothesis space H does not contain the target classifier and thus we do not know for sure which
class the instance has.
• Case 3: The training data contains noise. Therefore we cannot be certain if we are classifying correctly.
1.9.1 In Practice
• Case 2: H does not contain the target classifier. The solution in this case is to add a classifier that classifies
the instance differently than the classifiers in VS(D). In other words, we extend the volume of VS(D)
• Case 3: When the datasets are noisy. The solution is again to add a classifier that classifies the instances
differently than the classifiers in VS(D) and we extend the volume of VS(D) again.
8
2 Lecture 2: Decision Trees
Overview
Decision Trees for Classification
• Definition
• Classification Problems for Decision Trees
• Entropy and Information Gain
So in short what it does is for every decision attribute (e.g. weather forecast) create a child node and ”list” all
instances that apply to the leaf node. If it is all classified correctly (no leaf node contains a true and a false at the
same time).
9
2.1.3 Entropy
Basically what entropy does is calculate the impurity of the training data.
• E(S) = −p+ log2 p+ p− log2 p−
where S is a sample of the training data, p+ refers to the proportion of the positive training instances and p− to the
negative. This brings us to information gain:
where Sv = {s ∈ S|A(s) = V }, or the set of all samples s in S and A(s) are the attributes of sample S.
• Determine the attribute with the highest information gain on the training set.
• Use this attribute as the root, create a branch for each of the values the attribute can have.
• For each branch repeat the process with subset of the training set that is classified by that branch.
• we prefer trees with high information gain attributes near the root.
Note that the bias is not a restriction on the hypothesis space but a preference to some hypotheses.
10
2.3 Overfitting, Underfitting, and Pruning
Overfitting is the concept where a model contains more parameters than the data can reasonably suggest, or in
simpler terms: your models are learning too much from noise, and interpreting noise as actually meaningful data.
Therefore, overfit statistical models can suggest things that aren’t true, because it has learned too much from noise.
Overfitting generally happens when you have too many adjustable parameters than what would be optimal, or
more simply by being more complicated than necessary. Therefore, your model may ”learn” from noise a specific
example and assume that is actually an important characteristic, when in fact it was merely an outlier. Overfitting
can be avoided by being as general as possible, and then furthermore by finding some form of average between an
overfit and underfit model.
In science the principle of Occam’s Razor is the concept that the simplest solution is often the best or ”most
correct.” Essentially: ”Do not make things more complicated than necessary”. This view is also often used in machine
learning. When working with decision trees this holds as well. Big (complex) decision trees shelter the thread of
over-fitting. The bigger the tree, the bigger the risk of over-fitting.
2.3.3 Underfitting
Underfitting occurs when a statistical model or machine learning algorithm cannot adequately capture the underlying
structure of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the
model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and
low bias). It is often a result of an excessively simple model.
11
• Underfitness:
– When performance is poor (error is high, accuracy is low) on both the training AND unseen/testing data.
The model is too generic, and it is not learning enough, leading to poor performance all around. Can be
identified on a graph by seeing low accuracy rates for both sets of data.
• Optimality:
– When performance on both training data and the unseen/testing data follows a very similar pattern,
meaning that something that is affecting the training data is also affecting unseen data, leading to the
conclusion that something else besides model fitness is at play.
1. As the validation set grows, the growing set shrinks, and vice-versa.
2. If the validation set is too small, it can make extremely general inferences on the data it contains, which it
then uses to inform the decision tree which can lead to an overly-pruned and too-small decision tree.
3. If the validation set is too large, it can lead to an under-pruned and too-large decision tree, leading to inefficiency
when making the decision tree.
4. The size of the validation set is subjective relative to the data, and is often best ”played around with” in order
to generate the most efficient results, measured by other metrics such as relative error rates.
We do the above until further pruning is harmful: Evaluate impact on validation set for each node that can be
pruned and remove the sub-tree that most improves validation set accuracy.
• Sub-tree raising
1. Remove the sub-tree that has the parent of node d as root.
2. Place d at the place of its parent
3. Sort the training instances associated with the parent of d using the sub-tree with rood d.
Then again evaluate if the accuracy of the tree on the validation set has increased.
12
2.3.7 Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.
3. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent
instances.
So for converting into rules we do the following: Start at the root node; for every path to a leaf node we create a
rule using AND operators. Then for every rule try to prune it independently (see if you can achieve higher accuracy
by removing conditions in the rule).
2.3.8 Impurity
Impurity: The diversity of training instances. A high impurity means that of every class there is an equal amount of
instances. A low impurity means that every instance is of the same class. More formally we can describe impurity
as follows: Let S be a sample of training instances; pj the proportions of instances of class j (j=1, ..., J) in S. An
impurity measure ( I(S) ) must satisfy the following:
• I(S) is minimum only when pi = 1 and pj = 0f orj 6= i (all objects are of same class)
1
• I(S) is maximum only when pj = J (there is exactly the same number of objects of all classes)
• I(S) is symmetric with respect to p1 , ..., pJ
13
2.6 Attributes with Many Values
If attributes have a lot of values this poses 2 problems:
1. No good splits: they fragment the data too quickly, leaving insufficient data at the next level.
• If node n tests the attribute A, assign most common value of A among other instances sorted to node n.
• If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are
estimated based on the observed frequencies of the values of A. These probabilities are used in the information
( |S v|
P
gain measure (via info gain) ( |S| E(Sv ))
v∈V alues(A)
2.8 Windowing
Lastly if we don’t have enough memory to fit all the training data in we can use a technique named windowing:
1. Select randomly n instances from the training data D and put them in window set W.
2. Train a decision tree DT on W.
14
3 Lecture 3: Evaluation of Learning Models
Overview
• Motivation
• Metrics for Classifier’s Evaluation
• Methods for Classifier’s Evaluation
• Comparing Data Mining Schemes
• Costs in Data Mining
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves
3.1 Motivation
Why evaluate classifier’s generalization performance (how good is the classifier in practice)
• Determine whether to employ classifier. I.e.: When using a limited data set for training we need to know how
accurate the classifier is in order to determine whether we can deploy the classifier)
• Optimization purposes. E.g. When post pruning, the accuracy must be determined on every pruning step.
3.2.2 Metrics
There are various metrics to evaluate a classifier:
T P +T N
• Accuracy = P +N = Ratio of correctly classified instances
F P +F N
• Error = P +N = Ratio of incorrectly classified instances
TP
• Precision = T P +F P = Ratio of correctly positively classified instances
TP
• Recall/TP rate (TPR) = P = Ratio of correctly classified positive instances
FP
• FP Rate (FPR) = N = Ratio of incorrectly classified negative instances
So to which data can we apply these metrics? Before we start we need to define stratification:
When stratificating data make sure that each class is represented with approximately equal proportions. This is a
more advanced version of balancing the data.
• Training Data (Not a good indicator because training data are not a good performance indicator for future
data)
• Independent test data (Requires plenty of data and a natural way to forming training and test data)
• Hold-out method (Data is split in training and test data usually 2/3 and 1/3 respectively. However if the data
is unbalanced samples may not be representative, e.g. few or no instances of a certain class)
15
• Repeated hold-out method (More reliable than regular hold-out method due to the fact that it repeats the
process with randomly selected different sub-samples possibly with stratification. But this method does not
avoid overlapping test data nor does it guarantee that all instances are used at least once)
• k-fold cross-validation method (Split data into k equally sized stratified subsets then each subset is used for
testing and the remainder for training. The metric estimates are averaged to yield an overall estimate. Standard
method = 10-fold stratified cross-validation. 10-fold gives best results, stratification reduces estimate’s variance.
Further improvement: Repeated 10-fold stratified cross-validation reduces the estimate’s variance even further)
• Leave-one-out cross-validation (number of folds = number of training instances. Makes best use of the data
BUT computationally expensive. Involves no random sub-sampling. Does not allow stratification. Worst case
scenario: data set split equally into 2 classes: 50% accurate on fresh data but estimated error is 100%)
• Bootstrap method aka 0.632 bootstrap (Cross-validation, but with replacement. Idea: take n samples (size
1) of a dataset with replacement to create a training set. Instances from original dataset the don’t occur
in the new training set are used for testing. Probability of instance ending up in test data = e−1 = 0.368
i.e. test data ≈ 36.8% of instances ⇔ training data ≈ 63.2%. requires special error estimation: error =
0.632 ∗ etestinstances + 0.368 ∗ etraininginstances where ex is the error of subset x. Repeat process several times
with different replacement samples and average the results.)
• And many more
16
• Re-sampling of instances according to costs
• Weighting of instances according to costs.
17
4 Lecture 4: Bayesian Learning
4.1 Introduction
• Each observed training instance can incrementally decrease or increase the estimated probability that a hy-
pothesis is correct.
• Prior knowledge is combined with observed data to determine the final probability of a hypothesis.
• Bayesian methods accomodate hypotheses that make probabilistic predictions (e.g. 93% chance of recovery)
• Instances are classified by combining predictions of multiple hypotheses, weighted by their probabilities.
• Requires initial knowledge of many probabilities.
• High computational cost.
• Is a standard for optimal learning.
18
4.7 Bayes Optimal Classifier
Another problem is the following: Given data D, hypothesis space H, and a new instance x, what is the most probable
classification of x? It is not the most probable hypothesis in H. The Bayes optimal classifier assigns to an instance
the classification cj that has the maximum posterior probability P (cj | D). Now the maximum posterior probability
P (cj | D) is calculated using the theorem for total probability. It is calculated using all the hypotheses weighted by
their posterior probabilities w.r.t. the data D:
vOB = arg maxcj ∈{+,−} P P (cj | D)
= arg maxcj ∈{+,−} P (cj | hi )P (hi | D)
hi ∈H
Best classification method according to its average accuracy. However the bayes optimal classifier may not be in
the hypothesis space!
Q
• vM AP = arg max P (Vj ) P (ai | vj )
i
19
5 Lecture 5: Linear Regression
TL;DR
Linear regression is the act of trying to define a function Y given an input vector X based on the values x ∈ X that
best describe the patterns of X. We usually do this by finding the least-square error between data point x and the
approximated function Y, or by minimizing a penalized version of the least squares loss function.
• Hypothesis:
– hΘ (x) = Θ0 + Θ1 x
• Parameters:
– Θ0 , Θ1
• Cost Function J(Θ0 , Θ1 ):
m
1
(hΘ (x(i) ) − y (i) )2 , where:
P
– J(Θ0 , Θ1 ) = 2m
i=1
– hΘ (x(i) ) − y (i)
is the minimized difference between the calculated result and the actual test data. To find out optimal values for
the parameters Θ0 and Θ1 we want to minimize the difference between the calculated result and the actual result of
our test data.
We attach the coefficient 12 to prevent the square 2 from having an effect on the resulting derivative. We also
divide by the number of summands m to get the average cost per data point.
20
The error measure in the cost function is a ”statistical distance”; in contrast to the popular and preliminary
understanding of distance between two vectors in Euclidean space. With statistical distance we are attempting to
map the ”dis-similarity” between estimated model and optimal model to Euclidean space.
There is no constricting rule regarding the formulation of this statistical distance, but if the choice is appropriate
then a progressive reduction in this ’distance’ during optimization translates to a progressively improving model
estimation. Consequently, the choice of ’statistical distance’ or error measure is related to the underlying data
distribution.
21
5.3.1 Least Squares Error
Given a collection of data points (xi , yi ) once you have your hypothesis h for some Θ, your least squares error of h
on a single data point Θ is:
• (hΘ (xi ) − yi )2
1
If we sum up the errors for all Θ, we multiply by 2 to prevent the square 2 from having an effect on the derivative,
resulting in the total error:
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2
i=1
We also divide the total error by the number of summands m to get the average error per data point, giving
1
us the resulting coefficient of 2m .
m
1
(hΘ (x(i) ) − y (i) )2
P
• 2m
i=1
When comparing performance on two data sets of different size, the raw sum of squared errors are not directly
comparable because larger data sets tend to lead to higher error totals. When you normalize, you can compare
the average error per data point.
where m is the no. of data points and α is the learning rate. (Usually pre-defined)
NOTE: Simultaneously update every Θj ! Only after updating ALL Θ’s should you update hΘ (x)!
22
5.5 Normal Equation
5.5.1 Feature Scaling
With feature scaling we get all features in the [-1, 1] range. Basically what we do is we standardize the range of
independent variables or features of data, because scaling ensures that if some feature values are large it won’t lead
to them being used as a main predictor. This may optimize performance for the gradient descent algorithm and is
known as the normal equation.
5.7.1 Regularization
When applying regularization we alter the cost function into the following:
m n n
1
(hΘ (x(i) ) − y (i) )2 + λ Θ2j ] Where the regularization term we add is: λ Θ2j ]
P P P
• J(Θ) = 2m [
i=1 j=1 j=1
Regularization parameter λ is an input parameter to the model. Lambda can be selected by sub-sampling the
data and finding the variation. The value of lambda can reduce overfitting as it increases, however it does
this at the expense of greater bias.
23
For the gradient descent algorithm it would look as follows:
m
1 (i) λ
(hΘ (x(i) ) − y (i) )xj +
P
• Θj := Θj − α[ m m Θj ], then
i=1
m
λ 1 (i)
(hΘ (x(i) ) − y (i) )xj ]
P
• Θj := Θj (1 − α m ) − α[ m
i=1
1. Fights over-fitting
2. Guarantees matrix of full rank, and thus invertible
24
6 Lecture 6: Logistic Regression and Artificial Neural Networks
6.1 Logistic Regression
We can cast a binary classification problem into a continuous regression problem. However we can not simply use the
linear regression that we mentioned before. Logistic regression is used when the variable y that we want to predict
can only take on discrete values (i.e. Classification). Considering a binary classification problem (y = 0 or y = 1),
the hypothesis function could be defined so that it is bounded between [0, 1] in which we use some form of logistic
function, such as the Sigmoid Function. Other, more efficient functions exist such as the ReLU (Rectified Linear
Unit), however there are not covered in this course as the sigmoid function is a historical standard.
The decision boundary for the logistic sigmoid function is where hΘ (x) = 0.5 (values less than 0.5 means false,
values equal to or more than 0.5 means true). Another interesting property is that it also gives a chance of the instance
being of that class e.g. hΘ (x) = 0.7 means that there is a 70% chance that the instance is of the corresponding class,
so we get:
• hΘ (x) = g(Θ0 + Θ1 x1 + Θ2 x2 ) and we predict y=1 if:
• −3 + x1 + x2 ≥ 0
25
6.3 Gradient Descent for Logistic Regression
How do we find the right Θ parameter value? We use gradient descent!
• Repeat until convergence:
m
1 (i)
(hΘ (x(i) ) − y (i) )xj
P
1. Θj := Θj − α m
i−1
26
so:
(2) (1) (1) (1)
• a1 = g(Θ10 x0 + Θ11 x1 + Θ12 x2 )
(2) (1) (1) (1)
• a2 = g(Θ20 x0 + Θ21 x1 + Θ22 x2 )
(2) (2) (2) (2) (2) (2)
• hΘ (x) = g(Θ10 a0 + Θ11 a1 + Θ12 a2 )
• All inputs have some effect: Decision trees: selection of most important attribtutes, ANN ”Selects” attributes
by giving them higher/lower weights
• Explanatory power of ANNs is limited
– Model represented as weights in network
– No simple explanation why network makes a certain prediction (cf. trees can give a rule that was used)
– Networks can not easily be translated into a symbolic model (tree, ruleset)
27
7 Lecture 7: Recommender Systems
7.1 Collaborative Filtering
In short what this means is that we look at what other users/customers liked/rated and try to use this information
to recommend other products.
Note: the 2nd formula combines the knowledge from all users!
• So now we can update the gradient descent algorithm for this case:
!
(j) (j) (i)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α for k = 0
i:r(i,j)=1
!
(j) (j) (i) (j)
((Θ(j) )T x(i) − y (i,j) )xk
P
– Θk := Θk −α + λΘk for k 6= 0
i:r(i,j)=1
n m 2 n m n
1 λ (i)
(Θ(j) )T x(i) − y (i,j) (xk )2
P P P P
• minx(1) ,...,x(nm ) 2 + 2
j=1 j:r(i,j)=1 j=1 k=1
28
7.3.1 Collaborative Filtering Algorithm
1. Initialize the input featuers x(1) , ..., xnm , and weights Θ(1) , ..., Θ(nu ) to small random values.
2. Minimize the cost function J(x(1) , ..., xnm , Θ(1) , ..., Θ(nu )) using gradient descent (or another optimization al-
gorithm).
3. For a user with (learned) parameter Θ and a movie with (learned) features x, predict a star rating of ΘT x.
This can be done for similar reasons why we would use logistic regression in other classification cases.
7.4.4 Kernels
Pick data points in the space (named landmarks). Idea is that by applying a positive or negative weight to the
distance to a data point/kernel we can predict whether or not a new instance is a class:
• predict y = 1 if:
– Θ0 + Θ1 f1 + Θ2 f2 + ... + Θi fi ≥ 0
• given x:
(i)
||2
– fi = similarity(x, l(i) ) = exp(− ||x−l
2δ 2 ) where l(i) is kernel i
29
7.4.5 Cost Function
Hypothesis: Given x, compute features f ∈ Rm+1 :
• predict ”y=1” if ΘT f ≥ 0
Training:
m n
1
y (i) cost1 (ΘT f (i) ) + (1 − y (i) )cost0 (ΘT f (i) ) + Θ2j
P P
• minΘ C 2
i−1 j=1
30
8 Lecture 8:
8.1 Nearest Neighbor Algorithm
Idea: Instances that lie ”close” to each other are most likely similar to each other.
8.1.1 Properties
• Learning is very fast
• No info is lost (brings disadvantage: ”Details” may be noisy)
• Hypothesis space:
– Variable size
– Complexity of the hypothesis rises with the number of stored examples
• computations can take extra info about the needed predictions into account.
• Can use local models that work well in the neighborhood of the target example.
31
8.1.6 Distance Definition
The representation of the data is very critical. This makes or breaks the NN algorithm.
For example s Manhattan, Euclidean or Ln − norm for numerical attributes:
#dim
Ln (x1 , x2 ) = n
P
|x1,i − x2,i |n
i=1
Hammings distance for nominal attributes:
n
P
d(x, y) = δ(xi , yi )
i=1
where δ(xi , yi ) = 0 if xi = yi , and δ(xi , yi ) = 1 if xi 6= yi
• Sequence Distances:
– Dynamic Time Warping: Sequences are aligned ”one to one” (non linear alignments are possible)
– Dimensionality reduction
32
8.2 Distance-weighted kNN
Idea: give higher weight to closer instances so we can now use all training instances instead of only k aka ”Shepard’s
method”.
k
P
wi f (xi )
• fˆ(xq ) = i=1
Pk with wi = 1
d(xq ,xi )2
i=1 wi
This results in a fast learning algorithm but it has slow predictions. Efficiency:
• for each prediction, kNN needs to compute the distance for ALL stored examples.
• Prediction time = linear in the size of the data set, for large training sets and/or complex distances this can
be too slow to be practical.
The algorithm:
Incremental Deletion of Examples
Edited k-NN(S) S: Set of instances
For each instance x in S if x is correctly classified by S\x
Remove x from S
Return S
8.4 kD-trees
kD-trees: use a clever data structure to eliminate the need to compute all distances. kD-trees are similar to decision
trees except:
• Splits are made on the median/mean value of dimension with highest variance
• Each node stores one data point, leaves can be empty
Finds closest neighbor in logarithmic (depth of tree) time. However building a good kD-tree may take some time:
Learning time is no longer 0 and incremental learning is no longer trivial:
• kD-tree will no longer be balanced
• re-building the tree is recommended when the max depth becomes larger than 2 * the minimal required depth
(= log(N) with N training examples).
33
Using Prototypes: the rough decision surfaces of nearest neighbor can sometimes be considered a disadvantage. We
can solve two problems at once by using prototypes (= representative for a whole group of instances) For example
prototypes can be:
• single instances, replacing a group
• other structure, (e.g., rectangle/shape, rule, ..)
• Radial basis function networks Basically builds a global approximation as linear combination of local approxi-
Pk
mations. f (x) = w0 + wu Ku (d(xu , x))
u=1
−1
d2 (xu ,x)
A common choice for Ku (d(xu , x)) = e 2δu2 . by using this the influence of each local approximation u
goes down quickly with distance.
Locally weighted Regression Build local model in region around x (e.g. linear or quadratic model). Minimiz-
ing:
(f (x) − fˆ(x))2 .
P
• squared error for k neighbors: E1 (xq ) ≡
x∈kN N (xq )
34
8.8.1 Candidate Literals
There are two separate methods to determining candidate literals for these algorithms.
35
8.8.3 Heuristics
When is rule considered a good rule?
• High accuracy
Much more efficient search (He much smaller than H (set of all rules).
Less robust with respect to noise; noisy example may require a restart.
5. return BSR
36
9 Lecture 9: Clustering
9.1 Unsupervised Learning
Data just contains x, there is no given classification or other information. The main goal is to find structure in the
data. The definition of ground truth is often missing (no clear error function like in supervised learning.
9.2 Clustering
Problem definition:
Let X = (x1 , x2 , ..., xd ) be a d-dimensional feature vector.
Let D be a set of vectors, D = X1 , X2 , ..., Xn Given data D, group the N vectors into K groups such that the
grouping is optimal.
Clustering is used for:
• Establish prototypes or detect outliers
37
9.6 Cluster Assignment
• Hard clustering: Each item is a member of one cluster
• Soft Clustering: Each item has a probability of membership in each cluster
• Disjunctive clustering: An item belongs to only one cluster
• An item can be in more than one cluster
• Exhaustive clustering: Each item is a member of a cluster
• Partial Clustering: Some items do not belong to a cluster (in practice this is equal to exhaustive clustering
with singleton clusters)
9.8.1 Dendogram
Tree view on hierarchical clusters; how higher the topbar is (horizontal line) the higher the degree of difference within
cluster.
38
10 Lecture 10:
10.1 Reinforcement learning
Reinforcement learning stems from the situation where an agent only receives a reward after a sequence/series of
actions have been performed. It stems from biological and societal systems where an agent is given a reward (i.e.
Dopamine) based on a previous decision(s), instead of given constant guidance towards what is the correct or incorrect
decision.
In reinforcement learning, the agent typically does not possess full knowledge of the environment or the result of
each action. More formally:
• Given:
1. a Set of States S (known to the agent only after exploration)
2. a Set of Actions A (per state)
3. a Transition function: St = δ(st , at ) (unknown to agent) where δ represents the state transition
4. a Reward function: rt = r(st , at ) (unknown to agent)
• Find:
1. Policy π : S → A that outputs an appropriate action a from set A, given the current state s from set S
such that π(st ) = at .
where gamma 0 ≤ γ ≤ 1 is a ”discount factor” that leads us to prefer either immediate reward or delayed reward
(higher values of γ → later reward preference). Therefore, the optimal policy becomes:
∗
• π ∗ ≡ ArgM axa V π (s), (∀s) where V π is the value function of the optimal policy for state s:
∗
• V π (s) or V ∗ (s)
However, this demonstrates a problem. How can we learn the optimal policy π ∗ : S → A for arbitrary environ-
ments? Since training data hs, ai is not available, π ∗ cannot be directly learned because the agent can only directly
choose a and not s. This leads us to the concept of Q-Learning.
The problem with this is that the agent typically does not have perfect knowledge of δ (the state transitions) or
r (the reward in all states). This means that agents cannot predict the reward and the immediate successor state
→ V ∗ cannot be directly learned directly. Solution - learn the Q-values instead by computing the optimal
Q-values for all state-action pairs using the Bellman equation:
39
10.3.2 Learning the Q-Values
We use iterative approximation to learn the Q values for a given state-action pair:
• V ∗ = M axa0 Q(s, a0 )
So that we can rewrite:
• Q(s, a) = r(s, a) + γM axa0 Q(δ(s, a), a0 )
And we then obtain the recursive update rule that allows an iterative approximation of Q:
• Q̂(s, a) ← r(s, a) + M axa0 Q̂(s0 , a0 )
This way, the agent stores the value Q̂(s, a) in a large look-up table. Then the agent repeatedly observes its own
current state s, chooses some action a, and observes the resulting reward r(s, a) and the new state s0 = δ(s, a). This
way, the agent repeatedly samples from unknown functions δ(s, a) and r(s, a) without having full knowledge of these
functions.
Actions with higher Q̂(s, a) are more likely to be picked compared to other actions. High k = higher exploitation
factor, lower k = higher exploration factor.
40
10.5 Expectation Maximization
Given the statistical model which generates a set X of observed data, a set of unobserved latent data or missing
values Z , and a vector of unknown parameters θ, along with a likelihood function L(θ);X ,Z) = p(X, Z|θ), the
maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the
observed data:
Z
• L(θ; X) = p(X|θ) = p(X, Z|θ)dZ
However, this quantity is often intractable (e.g. if z is a sequence of events, so that the number of values grows
exponentially with the sequence length, making the exact calculation of the sum extremely difficult).
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:
– θ (t+1) = arg max Q(θ|θ (t) ) θ (t+1) = arg max Q(θ|θ (t) )
θ θ
The typical models to which EM is applied uses Z as a latent variable indicating membership in one of a set of
groups:
The observed data points x may be discrete (taking values in a finite or countably infinite set) or continuous
(taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. The
missing values (aka latent variables) Z are discrete, drawn from a fixed number of values, and with one latent variable
per observed unit. The parameters are continuous, and are of two kinds: Parameters that are associated with all
data points, and those associated with a specific value of a latent variable (i.e., associated with all data points which
corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models.
41