Co-2 ML 2019
Co-2 ML 2019
• Supervised Learning
Nearest Neighbour
Naive Bayes
Logistic Regression
Support Vector Machines
Neural Networks
Decision Trees
• Unsupervised & Semi-Supervised Learning
Clustering (K-means, GMMS)
Factor Analysis (PCA, LDA)
• Learning Theory
Bias and Variance,
Probably Approximately Correct (PAC) Learning
1
Supervised Learning
• The model is “trained” on a pre-defined set of “training examples”, which then facilitate its
ability to reach an accurate conclusion when given new data.
• Supervised algorithms learn from labelled training data. The algorithms are “supervised”
because we know what the correct answer is.
• For example if the algorithm receives a bunch of images labelled as apples or oranges it can
first guess the object in the image, then use the label to check if its guess is correct.
• It is called supervised learning because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers, the algorithm iteratively makes predictions on the training data and is corrected by
the teacher. Learning stops when the algorithm achieves an acceptable level of performance.
2
K-Nearest Neighbour
• The most basic instance-based method is the k-nearest neighbour
algorithm.
• This algorithm assumes all instances correspond to points in the n-
dimensional space ℜn.
• The nearest neighbours of an instance are defined in terms of the
standard Euclidean distance. More precisely, let an arbitrary instance x
be described by the feature vector.
where ar( x ) denotes the value of the rth attribute of instance x. Then the
distance between two instances xi and xj is defined to be d( xi , xj ),
where
3
K-Nearest Neighbour set of positive and negative training examples is shown on the left,
along with a query instance xq, to be classified. The 1-Nearest Neighbour algorithm
classifies xq, positive, whereas 5-Nearest Neighbour classifies it as negative. On the right
is the decision surface induced by the 1-Nearest Neighbour algorithm for a typical set of
training examples. The convex polygon surrounding each training example indicates the
region of instance space closest to that point (i.e., the instances for which the 1-Nearest
Neighbour algorithm will assign the classification belonging to that training example).
4
• The intuition behind the KNN algorithm is one of the simplest of all the supervised
machine learning algorithms.
• It simply calculates the distance of a new data point to all other training data points.
• It then selects the K-nearest data points, where K can be any integer. Finally it
assigns the data point to the class to which the majority of the K data points belong.
5
• Euclidean distance computes the root of square difference between
coordinates of pair of objects. Mathematically, it can be represented as
6
• Your task is to classify a new data point with 'X' into "Blue" class or "Red" class. The
coordinate values of the data point are x=45 and y=50. Suppose the value of K is 3.
The KNN algorithm starts by calculating the distance of point X from all the points. It
then finds the 3 nearest points with least distance to point X. This is shown in the
figure below. The three nearest points have been encircled.
• The final step of the KNN algorithm is to assign new point to the class to which
majority of the three nearest points belong. From the figure above we can see that the
two of the three nearest points belong to the class "Red" while one belongs to the class
"Blue". Therefore the new data point will be classified as "Red".
7
Pros
•It is extremely easy to implement
•As said earlier, it is lazy learning
•algorithm and therefore requires no training prior to making real time predictions. This
makes the KNN algorithm much faster than other algorithms that require training e.g SVM,
linear regression etc.
•Since the algorithm requires no training before making predictions, new data can be added
seamlessly.
•There are only two parameters required to implement KNN i.e. the value of K and the
distance function (e.g. Euclidean or Manhattan etc.)
Cons
•The KNN algorithm doesn't work well with high dimensional data because with large
number of dimensions, it becomes difficult for the algorithm to calculate distance in each
dimension.
•The KNN algorithm has a high prediction cost for large datasets. This is because in large
datasets the cost of calculating distance between new point and each existing point
becomes higher.
•Finally, the KNN algorithm doesn't work well with categorical features since it is difficult
to find the distance between dimensions with categorical features.
8
Applications of KNN Algorithm
9
Naive Bayes
10
11
12
Bayes theorem provides a way of calculating posterior probability P(Y|X) from P(Y), P(X) and P(X|Y).
13
• For example, a fruit may be considered to
be an apple if it is red, round, and about 3
inches in diameter. Even if these features
depend on each other or upon the existence
of the other features, all of these properties
independently contribute to the probability
that this fruit is an apple and that is why it is
known as ‘Naive’.
14
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to
classify whether players will play or not based on weather condition. Let’s follow the
below steps to perform it.
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
15
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
16
Cons
•It is easy and fast to predict class of test data set. It also perform well in multi class prediction.
•When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
•It perform well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Pros
•If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This
is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One
of the simplest smoothing techniques is called Laplace estimation.
•On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
•Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely independent.
17
Applications of Naive Bayes Algorithms
Real time Prediction:
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes
classifiers mostly used in text classification (due to better result in multi class
problems and independence rule) have higher success rate as compared to other
algorithms.
• It is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering
together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would
like a given resource or not.
18
Logistic Regression
“Regression Analysis is a predictive modelling technique. It estimates the relationship
between a dependent (target) and an independent variable(predictor).”
19
Comparison of linear & logistic regression
20
Types of logistic regression
•Binary (Pass/Fail)
•Multi (Cats, Dogs, Sheep)
•Ordinal (Low, Medium, High)
21
Comparison of linear & logistic regression
22
Sigmoid Activation
23
Decision Boundary
Our current prediction function returns a probability score between 0 and 1. In order
to map this to a discrete class (true/false, cat/dog), we select a threshold value or
tipping point above which we will classify values into class 1 and below which we
classify values into class 2.
p ≥ 0.5,class = 1
p < 0.5,class = 0
For example, if our threshold was .5 and our prediction function returned .7, we
would classify this observation as positive. If our prediction was .2 we would classify
the observation as negative. For logistic regression with multiple classes we could
select the class with the highest predicted probability.
24
Binary Logistic Regression
Say we’re given data on student exam results and our goal is to predict
whether a student will pass or fail based on number of hours slept and
hours spent studying. We have two features (hours slept, hours studied)
and two classes: passed (1) and failed (0).
25
Support Vector Machines
• A support vector machine allows you to classify data that’s linearly
separable.
• If it isn’t linearly separable, use the kernel trick to make it work.
• However, for text classification it’s better to just stick to a linear
kernel.
26
27
Advantages
28
Neural Networks
29
Neural Networks
30
Perceptron
• Perceptron has just 2 layers of nodes (input nodes and output nodes).
Often called a single-layer network on account of having 1 layer of
links, between input and output.
31
• The single perceptron approach to deep learning has one major
drawback: it can only learn linearly separable functions.
32
Neural Networks
33
34
35
36
37
38
39
Activation Functions
40
41
42
43
44
45
46
47
The Problem with Large Networks
A neural network can have more than one hidden layer: in that case, the higher layers are “building” new
abstractions on top of previous layers. And as we mentioned before, you can often learn better in-practice
with larger networks.
However, increasing the number of hidden layers leads to two known issues:
48
Decision Trees
• Decision tree learning uses a decision tree (as a predictive model) to go
from observations about an item (represented in the branches) to
conclusions about the item's target value (represented in the leaves).
• Tree models where the target variable can take a discrete set of values
are called classification trees; in these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to
those class labels.
• Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees.
49
Decision trees used in machine learning are of two main types:
50
A tree showing survival of
passengers on the Titanic ("sibsp"
is the number of spouses or
siblings aboard). The figures
under the leaves show the
probability of survival and the
percentage of observations in the
leaf. Summarizing: Your chances
of survival were good if you were
(i) a female or (ii) a male younger
than 9.5 years with less than 2.5
siblings.
51
Decision trees have various advantages
•Simple to understand and interpret.
•Able to handle both numerical and categorical data.
•Other techniques are usually specialized in analyzing datasets that have only one type of variable. (For example,
relation rules can be used only with nominal variables while neural networks can be used only with numerical
variables or categoricals converted to 0-1 values.)
•Requires little data preparation. Other techniques often require data normalization. Since trees can handle
qualitative predictors, there is no need to create dummy variables.
•Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily
explained by boolean logic. Possible to validate a model using statistical tests.
•Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no
distributional, independence, or constant variance assumptions
•Performs well with large datasets.
•Mirrors human decision making more closely than other approaches.
•Robust against co-linearity, particularly boosting.
•In built feature selection. Additional irrelevant feature will be less used so that they can be removed on subsequent
runs.
•Decision trees can approximate any Boolean function eq. XOR
52
Limitations
• Trees can be very non-robust. A small change in the training data can
result in a large change in the tree and consequently the final
predictions.
• The problem of learning an optimal decision tree is known to be NP-
complete under several aspects of optimality and even for simple
concepts.
• Decision-tree learners can create over-complex trees that do not
generalize well from the training data (overfitting). Mechanisms such
as pruning are necessary to avoid this problem
• For data including categorical variables with different numbers of
levels, information gain in decision trees is biased in favor of
attributes with more levels.
53
Learning Theory: Bias and Variance
• In statistics and machine learning, the bias–variance tradeoff is the property of a set of
predictive models whereby models with a lower bias in parameter estimation have a
higher variance of the parameter estimates across samples, and vice versa.
54
• The bias is an error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).
55
Bias
56
Variance
57
Approaches to reduce Bias-Variance trade off
problem
• Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly,
a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at
the expense of introducing additional variance. Learning algorithms typically have some tunable
parameters that control bias and variance; for example,
• linear and Generalized linear models can be regularized to decrease their variance at the cost of
increasing their bias.
• In artificial neural networks, the variance increases and the bias decreases as the number of hidden
units increase
• In k-nearest neighbor models, a high value of k leads to high bias and low variance.
• In instance-based learning, regularization can be achieved varying the mixture of prototypes and
exemplars.
• In decision trees, the depth of the tree determines the variance. Decision trees are commonly pruned to
control variance.
• One way of resolving the trade-off is to use mixture models and ensemble learning.
58
Probably Approximately Correct (PAC)
Learning
59
Computational learning theory
With high probability an (efficient) learning algorithm will find a hypothesis that
is approximately identical to the hidden target concept.
Underlying principle:
Carla P. Gomes
CS4700
Notations
Notations:
– X: set of all possible examples
– D: distribution from which examples are drawn
– H: set of all possible hypotheses
– N: the number of examples in the training set
– f: the true function to be learned
error(h) ≤ ε,
Goal:
I.e, chance of “bad” hypothesis, (high error but consistent with examples) is
small (i.e, less than )
Carla P. Gomes
CS4700
Approximately Correct
So for N examples, P(hb agrees with N examples) (1- ε )N. Carla P. Gomes
CS4700
Approximately Correct Hypothesis
Carla P. Gomes
CS4700
P(Hbad contains a consistent hypothesis) (|Hbad|(1- ε )N (|H|(1- ε )N
Goal –
Bound the probability of learning a bad hypothesis below some
small number .
Note:
(The more accuracy (with smaller ε), and the more certainty desired (with smaller δ), the more
examples one needs.)
– An efficient learning algorithm
Theoretical results apply to fairly simple learning models (e.g., decision list learning)
Carla P. Gomes
CS4700
PAC Learning
Two steps:
Carla P. Gomes
CS4700
Example:
Boolean Functions
2n
Consider H the set of all Boolean function on n attributes | H | 2
1 1
N (ln ln | H |) O(2 n )
So the sample complexity grows as 2n !
(same as the number of all possible examples)
Not PAC-Learnable!
So, any learning algorithm will do not better than a lookup table
if it merely returns a hypothesis that is consistent with all known
examples!