0% found this document useful (0 votes)
51 views71 pages

Co-2 ML 2019

The document provides an overview of supervised machine learning algorithms including nearest neighbor classification, naive Bayes classification, and decision trees. It discusses the basic concepts behind each algorithm such as how nearest neighbor classifies points based on their distance to labeled examples and how naive Bayes makes predictions based on applying Bayes' theorem with independence assumptions. Examples are given to illustrate how each algorithm works. Unsupervised learning techniques like clustering are also briefly mentioned.

Uploaded by

UrsTruly Anirudh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views71 pages

Co-2 ML 2019

The document provides an overview of supervised machine learning algorithms including nearest neighbor classification, naive Bayes classification, and decision trees. It discusses the basic concepts behind each algorithm such as how nearest neighbor classifies points based on their distance to labeled examples and how naive Bayes makes predictions based on applying Bayes' theorem with independence assumptions. Examples are given to illustrate how each algorithm works. Unsupervised learning techniques like clustering are also briefly mentioned.

Uploaded by

UrsTruly Anirudh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 71

CO-II

• Supervised Learning
 Nearest Neighbour
 Naive Bayes
 Logistic Regression
 Support Vector Machines
 Neural Networks
 Decision Trees
• Unsupervised & Semi-Supervised Learning
 Clustering (K-means, GMMS)
 Factor Analysis (PCA, LDA)
• Learning Theory
 Bias and Variance,
 Probably Approximately Correct (PAC) Learning

1
Supervised Learning

• The model is “trained” on a pre-defined set of “training examples”, which then facilitate its
ability to reach an accurate conclusion when given new data.

• Supervised algorithms learn from labelled training data. The algorithms are “supervised”
because we know what the correct answer is.

• For example if the algorithm receives a bunch of images labelled as apples or oranges it can
first guess the object in the image, then use the label to check if its guess is correct.

• It is called supervised learning because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers, the algorithm iteratively makes predictions on the training data and is corrected by
the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

2
K-Nearest Neighbour
• The most basic instance-based method is the k-nearest neighbour
algorithm.
• This algorithm assumes all instances correspond to points in the n-
dimensional space ℜn.
• The nearest neighbours of an instance are defined in terms of the
standard Euclidean distance. More precisely, let an arbitrary instance x
be described by the feature vector.

where ar( x ) denotes the value of the rth attribute of instance x. Then the
distance between two instances xi and xj is defined to be d( xi , xj ),
where

3
K-Nearest Neighbour set of positive and negative training examples is shown on the left,
along with a query instance xq, to be classified. The 1-Nearest Neighbour algorithm
classifies xq, positive, whereas 5-Nearest Neighbour classifies it as negative. On the right
is the decision surface induced by the 1-Nearest Neighbour algorithm for a typical set of
training examples. The convex polygon surrounding each training example indicates the
region of instance space closest to that point (i.e., the instances for which the 1-Nearest
Neighbour algorithm will assign the classification belonging to that training example).

4
• The intuition behind the KNN algorithm is one of the simplest of all the supervised
machine learning algorithms.

• It simply calculates the distance of a new data point to all other training data points.

• The distance can be of any type e.g. Euclidean or Manhattan etc.

• It then selects the K-nearest data points, where K can be any integer. Finally it
assigns the data point to the class to which the majority of the K data points belong.

5
• Euclidean distance computes the root of square difference between
coordinates of pair of objects. Mathematically, it can be represented as

• Manhattan distance computes the absolute differences between


coordinates of pair of objects. Mathematically, it can be represented as

6
• Your task is to classify a new data point with 'X' into "Blue" class or "Red" class. The
coordinate values of the data point are x=45 and y=50. Suppose the value of K is 3.
The KNN algorithm starts by calculating the distance of point X from all the points. It
then finds the 3 nearest points with least distance to point X. This is shown in the
figure below. The three nearest points have been encircled.
• The final step of the KNN algorithm is to assign new point to the class to which
majority of the three nearest points belong. From the figure above we can see that the
two of the three nearest points belong to the class "Red" while one belongs to the class
"Blue". Therefore the new data point will be classified as "Red".

7
Pros
•It is extremely easy to implement
•As said earlier, it is lazy learning
•algorithm and therefore requires no training prior to making real time predictions. This
makes the KNN algorithm much faster than other algorithms that require training e.g SVM,
linear regression etc.
•Since the algorithm requires no training before making predictions, new data can be added
seamlessly.
•There are only two parameters required to implement KNN i.e. the value of K and the
distance function (e.g. Euclidean or Manhattan etc.)

Cons
•The KNN algorithm doesn't work well with high dimensional data because with large
number of dimensions, it becomes difficult for the algorithm to calculate distance in each
dimension.
•The KNN algorithm has a high prediction cost for large datasets. This is because in large
datasets the cost of calculating distance between new point and each existing point
becomes higher.
•Finally, the KNN algorithm doesn't work well with categorical features since it is difficult
to find the distance between dimensions with categorical features.
8
Applications of KNN Algorithm

• KNN is a simple yet powerful classification algorithm. It requires no


training for making predictions, which is typically one of the most
difficult parts of a machine learning algorithm.
• The KNN algorithm have been widely used to find document
similarity and pattern recognition.
• It has also been employed for developing recommender systems and
for dimensionality reduction and pre-processing steps for computer
vision, particularly face recognition tasks.

9
Naive Bayes

• It is a classification technique based on Bayes Theorem with an


assumption of independence among predictors.
• In simple terms, a Naive Bayes classifier assumes that the presence of
a particular feature in a class is unrelated to the presence of any other
feature.
• Naive Bayes model is easy to build and particularly useful for very
large data sets. Along with simplicity, Naive Bayes is known to
outperform even highly sophisticated classification methods.

10
11
12
Bayes theorem provides a way of calculating posterior probability P(Y|X) from P(Y), P(X) and P(X|Y).

13
• For example, a fruit may be considered to
be an apple if it is red, round, and about 3
inches in diameter. Even if these features
depend on each other or upon the existence
of the other features, all of these properties
independently contribute to the probability
that this fruit is an apple and that is why it is
known as ‘Naive’.

14
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to
classify whether players will play or not based on weather condition. Let’s follow the
below steps to perform it.
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.

15
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

16
Cons
•It is easy and fast to predict class of test data set. It also perform well in multi class prediction.
•When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
•It perform well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Pros
•If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This
is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One
of the simplest smoothing techniques is called Laplace estimation.
•On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
•Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely independent.

17
Applications of Naive Bayes Algorithms
Real time Prediction:

• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes
classifiers mostly used in text classification (due to better result in multi class
problems and independence rule) have higher success rate as compared to other
algorithms.
• It is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering
together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would
like a given resource or not.
18
Logistic Regression
“Regression Analysis is a predictive modelling technique. It estimates the relationship
between a dependent (target) and an independent variable(predictor).”

• Its a classification algorithm used to assign observations to a discrete set of classes.


• Unlike linear regression which outputs continuous number values, logistic regression transforms its
output using the logistic sigmoid function to return a probability value which can then be mapped to
two or more discrete classes.

19
Comparison of linear & logistic regression

Example Given data on time spent studying and exam scores.


Linear Regression and logistic regression can predict different
things:
•Linear Regression could help us predict the student’s test score on a scale of
0 - 100. Linear regression predictions are continuous (numbers in a range).
•Logistic Regression could help use predict whether the student passed or
failed. Logistic regression predictions are discrete (only specific values or
categories are allowed). We can also view probability scores underlying the
model’s classifications.

20
Types of logistic regression

•Binary (Pass/Fail)
•Multi (Cats, Dogs, Sheep)
•Ordinal (Low, Medium, High)

21
Comparison of linear & logistic regression

22
Sigmoid Activation

In order to map predicted values to probabilities, we use the sigmoid function.


The function maps any real value into another value between 0 and 1.
In machine learning, we use sigmoid to map predictions to probabilities.

S(z) = output between 0 and 1 (probability


estimate)
z = input to the function (your algorithm’s
prediction e.g. mx + b)
e = base of natural log

23
Decision Boundary

Our current prediction function returns a probability score between 0 and 1. In order
to map this to a discrete class (true/false, cat/dog), we select a threshold value or
tipping point above which we will classify values into class 1 and below which we
classify values into class 2.

p ≥ 0.5,class = 1
p < 0.5,class = 0

For example, if our threshold was .5 and our prediction function returned .7, we
would classify this observation as positive. If our prediction was .2 we would classify
the observation as negative. For logistic regression with multiple classes we could
select the class with the highest predicted probability.

24
Binary Logistic Regression

Say we’re given data on student exam results and our goal is to predict
whether a student will pass or fail based on number of hours slept and
hours spent studying. We have two features (hours slept, hours studied)
and two classes: passed (1) and failed (0).

25
Support Vector Machines
• A support vector machine allows you to classify data that’s linearly
separable.
• If it isn’t linearly separable, use the kernel trick to make it work.
• However, for text classification it’s better to just stick to a linear
kernel.

26
27
Advantages

• Compared to newer algorithms like neural networks, they have two


main advantages: higher speed and better performance with a limited
number of samples (in the thousands).

• This makes the algorithm very suitable for text classification


problems, where it’s common to have access to a dataset of at most a
couple of thousands of tagged samples.

28
Neural Networks

29
Neural Networks

• In 1943, Warren S. McCulloch, a neuroscientist, and Walter Pitts, a


logician, developed the first conceptual model of an artificial neural
network. In their paper, "A logical calculus of the ideas imminent in
nervous activity,” they describe the concept of a neuron, a single cell
living in a network of cells that receives inputs, processes those inputs,
and generates an output.

30
Perceptron

• Perceptron has just 2 layers of nodes (input nodes and output nodes).
Often called a single-layer network on account of having 1 layer of
links, between input and output.

• The training of the perceptron consists of feeding it multiple training


samples and calculating the output for each of them. After each
sample, the weights w are adjusted in such a way so as to minimize the
output error, defined as the difference between the desired (target) and
the actual outputs. There are other error functions, like the mean
square error, but the basic principle of training remains the same.

31
• The single perceptron approach to deep learning has one major
drawback: it can only learn linearly separable functions.

• To address this problem, we’ll need to use a multilayer perceptron,


also known as feedforward neural network: in effect, we’ll compose a
bunch of these perceptrons together to create a more powerful
mechanism for learning.

32
Neural Networks

33
34
35
36
37
38
39
Activation Functions

40
41
42
43
44
45
46
47
The Problem with Large Networks
A neural network can have more than one hidden layer: in that case, the higher layers are “building” new
abstractions on top of previous layers. And as we mentioned before, you can often learn better in-practice
with larger networks.
However, increasing the number of hidden layers leads to two known issues:

•Vanishing gradients: as we add more and more hidden layers, backpropagation


becomes less and less useful in passing information to the lower layers. In effect, as
information is passed back, the gradients begin to vanish and become small relative to the
weights of the networks.

•Overfitting: perhaps the central problem in machine learning. Briefly, overfitting


describes the phenomenon of fitting the training data too closely, maybe with hypotheses
that are too complex. In such a case, your learner ends up fitting the training data really well,
but will perform much, much more poorly on real examples.

48
Decision Trees
• Decision tree learning uses a decision tree (as a predictive model) to go
from observations about an item (represented in the branches) to
conclusions about the item's target value (represented in the leaves).

• It is one of the predictive modelling approaches used in statistics, data


mining and machine learning.

• Tree models where the target variable can take a discrete set of values
are called classification trees; in these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to
those class labels.

• Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees.

49
Decision trees used in machine learning are of two main types:

•Classification tree analysis is when the predicted outcome is the class


(discrete) to which the data belongs.
•Regression tree analysis is when the predicted outcome can be considered a
real number (e.g. the price of a house, or a patient's length of stay in a
hospital).

50
A tree showing survival of
passengers on the Titanic ("sibsp"
is the number of spouses or
siblings aboard). The figures
under the leaves show the
probability of survival and the
percentage of observations in the
leaf. Summarizing: Your chances
of survival were good if you were
(i) a female or (ii) a male younger
than 9.5 years with less than 2.5
siblings.

51
Decision trees have various advantages
•Simple to understand and interpret.
•Able to handle both numerical and categorical data.
•Other techniques are usually specialized in analyzing datasets that have only one type of variable. (For example,
relation rules can be used only with nominal variables while neural networks can be used only with numerical
variables or categoricals converted to 0-1 values.)
•Requires little data preparation. Other techniques often require data normalization. Since trees can handle
qualitative predictors, there is no need to create dummy variables.
•Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily
explained by boolean logic. Possible to validate a model using statistical tests.
•Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no
distributional, independence, or constant variance assumptions
•Performs well with large datasets.
•Mirrors human decision making more closely than other approaches.
•Robust against co-linearity, particularly boosting.
•In built feature selection. Additional irrelevant feature will be less used so that they can be removed on subsequent
runs.
•Decision trees can approximate any Boolean function eq. XOR

52
Limitations
• Trees can be very non-robust. A small change in the training data can
result in a large change in the tree and consequently the final
predictions.
• The problem of learning an optimal decision tree is known to be NP-
complete under several aspects of optimality and even for simple
concepts.
• Decision-tree learners can create over-complex trees that do not
generalize well from the training data (overfitting). Mechanisms such
as pruning are necessary to avoid this problem
• For data including categorical variables with different numbers of
levels, information gain in decision trees is biased in favor of
attributes with more levels.

53
Learning Theory: Bias and Variance
• In statistics and machine learning, the bias–variance tradeoff is the property of a set of
predictive models whereby models with a lower bias in parameter estimation have a
higher variance of the parameter estimates across samples, and vice versa.

• The bias–variance dilemma or problem is the conflict in trying to simultaneously


minimize these two sources of error that prevent supervised learning algorithms from
generalizing beyond their training set.

54
• The bias is an error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).

• The variance is an error from sensitivity to small fluctuations in the training


set. High variance can cause an algorithm to model the random noise in the
training data, rather than the intended outputs (overfitting).

55
Bias

• Models with low bias are usually more complex (e.g.


higher-order regression polynomials), enabling them to
represent the training set more accurately. In the process,
however, they may also represent a large noise component
in the training set, making their predictions less accurate -
despite their added complexity. In contrast, models with
higher bias tend to be relatively simple (low-order or even
linear regression polynomials) but may produce lower
variance predictions when applied beyond the training set.

56
Variance

• High-variance learning methods may be able to represent


their training set well but are at risk of overfitting to noisy
or unrepresentative training data. In contrast, algorithms
with low variance typically produce simpler models that
don't tend to overfit but may underfit their training data,
failing to capture important regularities.

57
Approaches to reduce Bias-Variance trade off
problem

• Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly,
a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at
the expense of introducing additional variance. Learning algorithms typically have some tunable
parameters that control bias and variance; for example,
• linear and Generalized linear models can be regularized to decrease their variance at the cost of
increasing their bias.
• In artificial neural networks, the variance increases and the bias decreases as the number of hidden
units increase
• In k-nearest neighbor models, a high value of k leads to high bias and low variance.
• In instance-based learning, regularization can be achieved varying the mixture of prototypes and
exemplars.
• In decision trees, the depth of the tree determines the variance. Decision trees are commonly pruned to
control variance.
• One way of resolving the trade-off is to use mixture models and ensemble learning.

58
Probably Approximately Correct (PAC)
Learning

59
Computational learning theory

Intersection of AI, statistics, and computational theory.

Introduce Probably Approximately Correct Learning concerning efficient


learning

For our learning procedures we would like to prove that:

With high probability an (efficient) learning algorithm will find a hypothesis that
is approximately identical to the hidden target concept.

Note the double “hedging” – probably and approximately.

Why do we need both levels of uncertainty (in general)?


Carla P. Gomes
CS4700
Probably Approximately
Correct Learning

Underlying principle:

Seriously wrong hypotheses can be found out almost certainly


(with high probability) using a “small” number of examples

– Any hypothesis that is consistent with a significantly large


set of training examples is unlikely to be seriously wrong: it
must be probably approximately correct.

– Any (efficient) algorithm that returns hypotheses that are


PAC is called a PAC-learning algorithm
Carla P. Gomes
CS4700
Probably Approximately
Correct Learning

How many examples are needed to guarantee correctness?

– Sample complexity (# of examples to “guarantee” correctness)


grows with the size of the Hypothesis space

– Stationarity assumption: Training set and test sets are drawn


from the same distribution

Carla P. Gomes
CS4700
Notations
Notations:
– X: set of all possible examples
– D: distribution from which examples are drawn
– H: set of all possible hypotheses
– N: the number of examples in the training set
– f: the true function to be learned

Assume: the true function f is in H.

Error of a hypothesis h wrt f :

Probability that h differs from f on a randomly picked example:

error(h) = P(h(x) ≠ f(x)| x drawn from D)

Exactly what we are trying to measure with our test set.


Carla P. Gomes
CS4700
Approximately Correct

A hypothesis h is approximately correct if:

error(h) ≤ ε,

where ε is a given threshold, a small constant

Goal:

Show that after seeing a small (poly) number of examples N, with


high probability, all consistent hypotheses, will be approximately correct.

I.e, chance of “bad” hypothesis, (high error but consistent with examples) is
small (i.e, less than )

Carla P. Gomes
CS4700
Approximately Correct

Approximately correct hypotheses lie inside


the ε -ball around f;
Those hypotheses that are seriously wrong (hb 
HBad) are outside the ε -ball,

Error(hbad)= P(hb(x) ≠ f(x)| x drawn from D) > ε,

Thus the probability that the hbad (a seriously wrong


hypothesis) disagrees with one example is at least ε
(definition of error).
Thus the probability that the hbad (a seriously wrong hypothesis) agrees with one
example is no more than (1- ε).

So for N examples, P(hb agrees with N examples)  (1- ε )N. Carla P. Gomes
CS4700
Approximately Correct Hypothesis

The probability that HBad contains at least one consistent hypothesis is


bounded by the sum of the individual probabilities.

P(Hbad contains a consistent hypothesis, agreeing with all the examples)


 (|Hbad|(1- ε )N  (|H|(1- ε )N

hbad agrees with one example is no more than (1- ε).

Carla P. Gomes
CS4700
P(Hbad contains a consistent hypothesis)  (|Hbad|(1- ε )N  (|H|(1- ε )N
Goal –
Bound the probability of learning a bad hypothesis below some
small number .

Note:

Sample Complexity: Number of examples to


guarantee a PAC learnable function class
If the learning algorithm returns a
hypothesis that is consistent with this many
Sample Complexity examples, then with probability at least (1-) the
learning algorithm has an error of at most ε.
The more accuracy (smaller ε), and
and the hypothesis is
the more certainty (with smaller δ)
Probably Approximately Correct.
one wants, the more examples one needs.
Probably Approximately correct hypothesis h:
– If the probability of a small error (error(h) ≤ ε ) is greater than or equal to
a given threshold 1 - δ
– A bound on the number of examples (sample complexity) needed to
guarantee PAC, that is polynomial

(The more accuracy (with smaller ε), and the more certainty desired (with smaller δ), the more
examples one needs.)
– An efficient learning algorithm

Theoretical results apply to fairly simple learning models (e.g., decision list learning)

Carla P. Gomes
CS4700
PAC Learning

Two steps:

Sample complexity – a polynomial number of examples suffices to specify a good consistent


hypothesis (error(h) ≤ ε ) with high probability ( 1 – δ).

Computational complexity – there is an efficient algorithm for learning a consistent hypothesis


from the small sample.

Let’s be more specific with examples.

Carla P. Gomes
CS4700
Example:
Boolean Functions

2n
Consider H the set of all Boolean function on n attributes | H | 2
1 1
N  (ln  ln | H |)  O(2 n )
 
So the sample complexity grows as 2n !
(same as the number of all possible examples)
Not PAC-Learnable!
So, any learning algorithm will do not better than a lookup table
if it merely returns a hypothesis that is consistent with all known
examples!

Intuitively what does it say about H?


About learning in general? Carla P. Gomes
CS4700
Coping With Learning Complexity

1. Force learning algorithm to look for smallest consistent hypothesis.

We considered that for Decision Tree Learning, often worst case


intractable though.

2. Restrict size of hypothesis space.


e.g., Decision Lists  restricted form of Boolean Functions:
Hypotheses correspond to a series of tests, each of which a
conjunction of literals

Good news: only a poly size number of examples


is required for guaranteeing PAC learning K-DL functions
and there are efficient algorithms for learning K-DL
Carla P. Gomes
CS4700

You might also like