Combined ML
Combined ML
[email protected] 2
LEARNING
Human learn our surrounding through 5 senses
eye,
ear,
nose,
tongue
and skin.
[email protected] 3
LEARNING
1. Rote Learning (memorization): Memorizing things without
knowing the concept or logic behind them.
2. Passive Learning (instructions): Learning from a
teacher/expert.
3. Analogy (experience): Learning new things from our past
experience.
4. Inductive Learning (experience): On the basis of past
experience, formulating a generalized concept.
5. Deductive Learning: Deriving new facts from past facts.
[email protected] 4
CONCEPT LEARNING
Inducing general functions from specific training examples is a
main issue of machine learning.
[email protected] 6
CONCEPT LEARNING
A Formal Definition for Concept Learning:
Inferring a Boolean-valued function from training examples of
its input and output.
• An example for concept-learning is the learning of bird-concept
from the given examples of birds (positive examples) and non-
birds (negative examples).
• We are trying to learn the definition of a concept from given
examples.
SPORT EXAMPLE
[email protected] 7
SPORT EXAMPLE
Concept to be learned:
Days in which Aldo can enjoy water sport
Attributes:
Sky: Sunny, Cloudy, Rainy Wind: Strong, Weak
AirTemp: Warm, Cold Water: Warm, Cool
Humidity: Normal, High Forecast: Same, Change
Instances in the training set:
(out of the 96 possible):
[email protected] 8
HYPOTHESES REPRESENTATION
h is a set of constraints on attributes:
a specific value: e.g. Water = Warm
any value allowed: e.g. Water = ?
no value allowed: e.g. Water = Ø
Example hypothesis:
Sky AirTemp Humidity Wind Water Forecast
Sunny, ?, ?, Strong, ?, Same
Corresponding to boolean function:
Sunny(Sky) ∧ Strong(Wind) ∧ Same(Forecast)
H, hypotheses space, all possible h
[email protected] 9
HYPOTHESIS SATISFACTION
An instance x satisfies an hypothesis h iff all the constraints expressed by h are
satisfied by the attribute values in x.
Example 1:
x1: Sunny, Warm, Normal, Strong, Warm, Same
h1: Sunny, ?, ?, Strong, ?, Same Satisfies?
Yes
Example 2:
x2: Sunny, Warm, Normal, Strong, Warm, Same
h2: Sunny, ?, ?, Ø, ?, Same Satisfies?
No
[email protected] 10
FORMAL TASK DESCRIPTION
Given:
X all possible days, as described by the attributes
A set of hypothesis H, a conjunction of constraints on the attributes,
representing a function h: X {0, 1}
[h(x) = 1 if x satisfies h; h(x) = 0 if x does not satisfy h]
A target concept: c: X {0, 1} where
c(x) = 1 iff EnjoySport = Yes;
c(x) = 0 iff EnjoySport = No;
A training set of possible instances D: {x, c(x)}
Goal: find a hypothesis h in H such that
h(x) = c(x) for all x in D
Hopefully h will be able to predict outside D…
[email protected] 11
THE INDUCTIVE LEARNING ASSUMPTION
We can at best guarantee that the output hypothesis fits the target
concept over the training data
Assumption: an hypothesis that approximates well the training data
will also approximate the target function over unobserved
examples
i.e. given a significant training set, the output hypothesis is able to
make predictions
[email protected] 12
CONCEPT LEARNING AS SEARCH
Concept learning is a task of searching an hypotheses space
The representation chosen for hypotheses determines the search
space
In the example we have:
3 x 25 = 96 possible instances (6 attributes)
1 + 4 x 35= 973 semantically distinct hypothesis
considering that all the hypothesis with some are semantically
equivalent, i.e. inconsistent
Structuring the search space may help in searching more efficiently
[email protected] 13
GENERAL TO SPECIFIC ORDERING
Consider: h1 = Sunny, ?, ?, Strong, ?, ?
h2 = Sunny, ?, ?, ?, ?, ?
Any instance classified positive by h1 will also be classified
positive by h2
h2 is more general than h1
Definition: hj g hk iff (x X ) [(hk(x) = 1) (hj (x)= 1)]
g more general or equal; >g strictly more general
Most general hypothesis: ?, ?, ?, ?, ?, ?
Most specific hypothesis: Ø, Ø, Ø, Ø, Ø, Ø [email protected] 14
GENERAL TO SPECIFIC ORDERING: INDUCED STRUCTURE
[email protected] 15
FIND-S: FINDING THE MOST SPECIFIC HYPOTHESIS
1.Initialize h to the most specific hypothesis in H
2.For each positive training instance:
for each attribute constraint a in h:
If the constraint a is satisfied by x then do nothing
else replace a in h by the next more general
constraint satified by x (move towards a more
general hp)
3.Output hypothesis h
[email protected] 16
FIND-S IN ACTION
[email protected] 17
EXAMPLE
To illustrate this algorithm, assume the learner is given the sequence of training
examples from the EnjoySport task
[email protected] 18
PROPERTIES OF FIND-S
Find-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
The final hypothesis will also be consistent with the negative examples
Problems:
There can be more than one “most specific hypotheses”
We cannot say if the learner converged to the correct target
Why choose the most specific?
If the training examples are inconsistent, the algorithm can be mislead: no tolerance to rumor.
Negative example are not considered
[email protected] 19
QUESTION
Consider the following data set having the data about which particular seeds are
poisonous.
[email protected] 20
EXAMPLE
First we consider the hypothesis to be more specific hypothesis.
Hence, our hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ}
Instance 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our initial hypothesis is
more specific and we have to generalize it for this example. Hence, the hypothesis becomes :
[email protected] 21
EXAMPLE
Instance 3 :
Here we see that this example has a negative outcome. Hence we neglect this example and our
hypothesis remains the same.
[email protected] 23
THANKS
[email protected] 24
MACHINE LEARNING (ML-5)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
[email protected] 1
AGENDA
Linear regression with one variable
[email protected] 2
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 WHAT?
[email protected] 3
QUIZ
X Y
2 4
3 9
5 25
9 81
7 49
11 121
10.5 110.25
[email protected] 4
HOW DO YOU FIND THAT?
You find the relation between X and Y
[email protected] 5
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
[email protected] 6
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ?
[email protected] 7
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER =
Y
[email protected] 8
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
So What is X?
[email protected] 9
HOW DO YOU FIND THAT?
You find the relation between X and Y
Y = X.X =X^2
Y=f(X)
Which one is dependent variable ? ANSWER = Y
SO What is X? X= Independent variable
[email protected] 10
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 WHAT?
[email protected] 11
QUIZ
X Y
2 3
3 5
5 9
9 10
7 6.5
11 11.8
10.5 Is it difficult to find out
the relation ?
[email protected] 12
GRAPH IS SOLUTION ?
Y
14
12
10
0
0 2 4 6 8 10 12
[email protected] 13
Y
14
12
10
0
0 2 4 6 8 10 12
[email protected] 14
APPROXIMATION
Y
14
11.8
12
10
10
9
8
6.5
Y
6
5
4
3
0
0 2 4 6 8 10 12
X
[email protected] 15
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
Y= m.X + c
[email protected] 16
FIND THE EQUATION OF LINE ?
Two Points are given (3, 5) and (9,10)
Find equation of line ?
Y= m.X + c
Y= 0.83X+2.5
[email protected] 17
DEFINITION
Finding the relation between dependent variable and
Independent variable is called Linear Regression.
OR
Finding the best fit line between dependent variable and
Independent variable is called Linear Regression.
[email protected] 18
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
[email protected] 19
DEFINITION
Finding the relation between dependent variable and Independent variable is called
Linear Regression.
Now X=10.5 , X=12 , 13, 2.5
Y= what?
What are you doing here?
(USES OF LINEAR REGRESSION)
FORECASTING
PREDICTION
[email protected] 20
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
[email protected] 21
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
[email protected] 22
THE ERROR (RESIDUALS)
ERROR =GIVEN DATA(ACTUAL DATA) – PREDICTED DATA
e=Y(actual) –Y(predicted)
“Question”
“How can we find the best fit line?”
“Answer”
If Y(actual) =Y(predicted) or e=0
Or
Minimise the error
[email protected] 23
HOW TO FIND BEST FIT LINE
Derivation of linear regression equations : (FOR SINGLE VARIABLE)
given a set of n points ( , ) on a scatterplot,
find the best-fit line, =a+ b
such that the sum of squared errors in Y, ∑( − ) is minimized.
[email protected] 24
Linear regression with one variable
MODEL REPRESENTATION *
SRC : * Andrew NG
[email protected] 25
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
(Portland, OR) 1416 232
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?
Learning Algorithm
Size of h Estimated
house price
SRC : * Andrew NG
[email protected] 29
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
[email protected] 30
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
[email protected] 31
y
[email protected] 32
Linear regression with one variable
COST FUNCTION
INTUITION I *
SRC : * Andrew NG
[email protected] 33
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
[email protected] 34
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
[email protected] 35
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
[email protected] 36
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
[email protected] 37
Linear regression with one variable
COST FUNCTION
INTUITION II *
SRC : * Andrew NG
[email protected] 38
Hypothesis:
Parameters:
Cost Function:
Goal:
[email protected] 39
(for fixed , this is a function of x) (function of the parameters )
500
400
100
0
0 500 1000 1500 2000 2500 3000
Size in feet2 (x)
[email protected] 40
[email protected] 41
(for fixed , this is a function of x) (function of the parameters )
[email protected] 42
(for fixed , this is a function of x) (function of the parameters )
[email protected] 43
(for fixed , this is a function of x) (function of the parameters )
[email protected] 44
(for fixed , this is a function of x) (function of the parameters )
[email protected] 45
Linear regression with one variable
GRADIENT DESCENT *
SRC : * Andrew NG
[email protected] 46
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
[email protected] 47
J()
[email protected] 48
J()
[email protected] 49
Gradient descent algorithm
[email protected] 50
Linear regression with one variable
GRADIENT DESCENT
INTUITION*
SRC : * Andrew NG
[email protected] 51
Gradient descent algorithm
[email protected] 52
[email protected] 53
If α is too small, gradient descent
can be slow.
[email protected] 54
Linear regression with one variable
GRADIENT DESCENT FOR
LINEAR REGRESSION*
SRC : * Andrew NG
[email protected] 55
Gradient descent algorithm Linear Regression Model
[email protected] 56
[email protected] 57
Gradient descent algorithm
update
and
simultaneously
[email protected] 58
J()
[email protected] 59
J()
[email protected] 60
[email protected] 61
(for fixed , this is a function of x) (function of the parameters )
[email protected] 62
(for fixed , this is a function of x) (function of the parameters )
[email protected] 63
(for fixed , this is a function of x) (function of the parameters )
[email protected] 64
(for fixed , this is a function of x) (function of the parameters )
[email protected] 65
(for fixed , this is a function of x) (function of the parameters )
[email protected] 66
(for fixed , this is a function of x) (function of the parameters )
[email protected] 67
(for fixed , this is a function of x) (function of the parameters )
[email protected] 68
(for fixed , this is a function of x) (function of the parameters )
[email protected] 69
(for fixed , this is a function of x) (function of the parameters )
[email protected] 70
“Batch” Gradient Descent
[email protected] 71
THANKS
[email protected] 72
MACHINE LEARNING (ML-6)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
[email protected] 1
AGENDA
Linear regression with multiple variable
Gradient descent for multiple variables
Gradient descent in practice I: Feature Scaling
SRC : * Andrew NG
[email protected] 2
Multiple features (variables).
2104 460
1416 232
1534 315
852 178
… …
[email protected] 3
Multiple features (variables).
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
[email protected] 4
Multiple features (variables).
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
[email protected] 5
Hypothesis:
Previously:
[email protected] 6
For convenience of notation, define .
SRC : * Andrew NG
[email protected] 8
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update )
[email protected] 10
Linear Regression with multiple variables
SRC : * Andrew NG
[email protected] 11
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-2000 feet2) size (feet2)
= number of bedrooms (1-5)
number of bedrooms
[email protected] 12
Feature Scaling
Get every feature into approximately a
range.
[email protected] 13
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
[email protected] 14
Linear Regression with multiple variables
SRC : * Andrew NG
[email protected] 15
Gradient descent
[email protected] 16
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
[email protected] 17
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
To choose , try
[email protected] 19
Linear Regression with multiple variables
[email protected] 20
Housing prices prediction
[email protected] 21
Polynomial regression
Price
(y)
Size (x)
[email protected] 22
Choice of features
Price
(y)
Size (x)
[email protected] 23
Linear Regression with multiple variables
NORMAL EQUATION
[email protected] 24
Gradient Descent
[email protected] 25
Intuition: If 1D
(for every )
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
[email protected] 27
Examples:
Size (feet2) Number of Number of Price ($1000)
bedrooms floors Age of home (years)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
1 3000 4 1 38 540
[email protected] 28
[email protected] 29
THANKS
[email protected] 30
MACHINE LEARNING (ML-7)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Logistic regression (Classification)
SRC : * Andrew NG
[email protected] 2
Classification
Malignant ?
(No) 0
Tumor Size Tumor Size
Logistic Regression:
Logistic Regression
HYPOTHESIS
REPRESENTATION
Logistic Regression Model
Want
0.5
Sigmoid function 0
Logistic function
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x
Example: If
z
Suppose predict “ “ if
predict “ “ if
Decision Boundary
x2
3
2
1 2 3 x1
Predict “ “ if
Non-linear decision boundaries
x2
-1 1 x1
-1
Predict “ “ if
x2
x1
Logistic Regression
COST FUNCTION
Training set:
m examples
“non-convex” “convex”
Logistic regression cost function
If y = 1
0 1
Logistic regression cost function
If y = 0
0 1
Logistic Regression
SIMPLIFIED COST FUNCTION AND
GRADIENT DESCENT
Logistic regression cost function
Logistic regression cost function
To fit parameters :
Want :
Repeat
Want :
Repeat
x2 x2
x1 x1
x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
One-vs-all
Train a logistic regression classifier for each
class to predict the probability that .
[email protected] 28
MACHINE LEARNING (ML-8)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
Machine Learning: Training, Testing, Evaluation
SRC : * Andrew NG
[email protected] 2
EVALUATING THE HYPOTHESIS
Fail to generalize to new examples not in
training set.
[email protected] 3
EVALUATING THE HYPOTHESIS
[email protected] 4
TRAINING/TESTING PROCEDURE FOR LINEAR
REGRESSION
Learn parameter from training data (minimizing training error ( ))
[email protected] 5
TRAINING/TESTING PROCEDURE FOR LOGISTIC
REGRESSION
Learn parameter from training data (minimizing training error ( ))
Compute test set error:
[email protected] 6
PYTHON IMPLEMENTATION
[email protected] 7
PYTHON IMPLEMENTATION
[email protected] 8
PYTHON IMPLEMENTATION
[email protected] 9
EVALUATION METRICS*
Training objective (cost function) is only a proxy for real world objectives.
Metrics are useful and important for evaluation.
Metrics help capture a business goal into a quantitative target.
Helps organize ML team effort towards that target. Generally in the form of
improving that metric on the dev set.
Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
[email protected] 11
BINARY CLASSIFICATION
x is input
y is binary output (0/1)
Model is ŷ= h(x)
Two types of models
Models that output a categorical class directly (K-nearest neighbor, Decision
tree)
Models that output a real valued score (SVM, Logistic Regression)
Score could be margin (SVM), probability
Need to pick a threshold
We focus on this type (the other type can be interpreted as an instance)
[email protected] 12
SCORE BASED MODELS
[email protected] 13
THRESHOLD -> CLASSIFIER -> POINT METRICS
[email protected] 14
POINT METRICS: CONFUSION MATRIX
[email protected] 15
POINT METRICS: TRUE POSITIVES
[email protected] 16
POINT METRICS: TRUE NEGATIVES
[email protected] 17
POINT METRICS: FALSE POSITIVES
[email protected] 18
POINT METRICS: FALSE NEGATIVES
[email protected] 19
FP AND FN ALSO CALLED TYPE-1 AND TYPE-2
ERRORS
[email protected] 20
POINT METRICS: ACCURACY
[email protected] 21
POINT METRICS: PRECISION
[email protected] 22
POINT METRICS: POSITIVE RECALL (SENSITIVITY)
[email protected] 23
POINT METRICS: NEGATIVE RECALL (SPECIFICITY)
[email protected] 24
POINT METRICS: F1-SCORE
[email protected] 25
POINT METRICS: CHANGING THRESHOLD
[email protected] 26
POINT METRICS: CHANGING THRESHOLD
[email protected] 27
SUMMARY METRICS: PRC (RECALL VS. PRECISION)
[email protected] 28
ROC (RECEIVER OPERATING CHARACTERISTICS)
•ROC curve is a performance measurement for classification problem at various
thresholds settings.
•ROC is a probability curve and AUC represents degree or measure of separability.
•It tells how much model is capable of distinguishing between classes.
•Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
•By analogy, Higher the AUC, better the model is at distinguishing between patients
with disease and no disease.
[email protected] 29
ROC CURVE
•The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is
on the x-axis.
[email protected] 30
DEFINING TERMS USED IN AUC AND ROC CURVE
TPR (True Positive Rate) / Recall /Sensitivity
[email protected] 31
HOW TO SPECULATE THE PERFORMANCE OF THE
MODEL?
•An excellent model has AUC near to the 1 which means it has good measure of
separability.
•A poor model has AUC near to the 0 which means it has worst measure of
separability. In fact it means it is reciprocating the result. It is predicting 0s as 1s and
1s as 0s.
•AUC is 0.5, it means model has no class separation capacity whatsoever.
[email protected] 32
INTERPRETATION OF ROC CURVE
As we know, ROC is a curve of probability. So lets plot the distributions of those
probabilities:
Red distribution curve is of the positive class (patients with disease) and green
distribution curve is of negative class(patients with no disease)
[email protected] 33
INTERPRETATION OF ROC CURVE
[email protected] 34
INTERPRETATION OF ROC CURVE
[email protected] 35
INTERPRETATION OF ROC CURVE
[email protected] 36
COMPARING ROC CURVES
[email protected] 37
RELATION BETWEEN SENSITIVITY, SPECIFICITY,
FPR AND THRESHOLD
•Sensitivity and Specificity are inversely proportional to each other. So when we
increase Sensitivity, Specificity decreases and vice versa.
•When we decrease the threshold, we get more positive values thus it increases the
sensitivity and decreasing the specificity.
•Similarly, when we increase the threshold, we get more negative values thus we get
higher specificity and lower sensitivity.
[email protected] 38
EXAMPLE: ROC
[email protected] 39
THANKS
[email protected] 40
MACHINE LEARNING (ML-9)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
[email protected] 1
AGENDA
Bias Vs Variance
[email protected] 2
[email protected] 3
[email protected] 4
[email protected] 5
[email protected] 6
[email protected] 7
[email protected] 8
[email protected] 9
[email protected] 10
[email protected] 11
[email protected] 12
[email protected] 13
[email protected] 14
[email protected] 15
[email protected] 16
[email protected] 17
[email protected] 18
[email protected] 19
[email protected] 20
[email protected] 21
[email protected] 22
[email protected] 23
[email protected] 24
[email protected] 25
[email protected] 26
[email protected] 27
[email protected] 28
[email protected] 29
[email protected] 30
[email protected] 31
[email protected] 32
[email protected] 33
[email protected] 34
[email protected] 35
[email protected] 36
[email protected] 37
[email protected] 38
[email protected] 39
[email protected] 40
[email protected] 41
[email protected] 42
[email protected] 43
[email protected] 44
[email protected] 45
[email protected] 46
[email protected] 47
TRADE-OFF (BIAS VS VARIANCE)
[email protected] 48
THANKS
[email protected] 49
MACHINE LEARNING (ML-10)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
K Nearest Neighbor
[email protected] 2
K-NEAREST-NEIGHBORS ALGORITHM
K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
If K=1, then the case is simply assigned to the class of its nearest
neighbor
DISTANCE FUNCTION MEASUREMENTS
HAMMING DISTANCE
For category variables, Hamming distance can be used.
K-NEAREST-NEIGHBORS
WHAT IS THE MOST POSSIBLE LABEL FOR C?
c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
Solution: Looking for the nearest K neighbors of c.
Take the majority label as c’s label
Let’s suppose k = 3:
WHAT IS THE MOST POSSIBLE LABEL FOR C?
c
WHAT IS THE MOST POSSIBLE LABEL FOR C?
The 3 nearest points to c are: a, a and o.
Therefore, the most possible label for c is a.
PSEUDO CODE OF KNN
1. Load the data
2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of
training data points
1. Calculate the distance between test data and each row of training data. Here we will
use Euclidean distance as our distance metric since it’s the most popular method.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class
[email protected] 12
REMARKS
CHOOSING THE MOST SUITABLE K
NORMALIZATION
NORMALIZATION
NORMALIZATION
NORMALIZATION
K-NEAREST NEIGHBOR CLASSIFICATION (KNN)
Unlike all the previous learning methods, kNN does not build model
from the training data.
To classify a test instance d, define k-neighborhood P as k nearest
neighbors of d
Count number n of training instances in P that belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is linear in training set size for
each test case.
19
DISCUSSIONS
kNN can deal with complex and arbitrary decision boundaries.
Despite its simplicity, researchers have shown that the classification accuracy of kNN
can be quite strong and in many cases as accurate as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable model
20
EXERCISE
[email protected] 21
EXERCISE
[email protected] 22
EXERCISE
Suppose, you have given the following data where x and y are the 2 input variables
and Class is the dependent variable.
Q1. Suppose, you want to predict the class of new data point x=1 and y=1 using
eucludian distance in 3-NN. In which class this data point belong to?
[email protected] 23
EXERCISE
Q2. In the previous question, you are now want use 7-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
Q2. In the previous question, you are now want use 5-NN instead of 3-KNN which of
the following x=1 and y=1 will belong to?
[email protected] 24
THANKS
[email protected] 25
MACHINE LEARNING (ML-11)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oNaïve Bayes Classifier
[email protected] 2
WHAT IS NAIVE BAYES ALGORITHM?
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.
• A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
• Naive Bayes model is easy to build and particularly useful for very large data sets.
• Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
[email protected] 3
PREREQUISITES FOR BAYES’ THEOREM
What is an Experiment?
“An experiment is a planned operation carried out under controlled conditions.”
Tossing a coin, rolling a die, and drawing a card out of a well-shuffled pack of cards are all
examples of experiments.
[email protected] 4
SAMPLE SPACE
The result of an experiment is called an outcome. The set of all possible outcomes of
an event is called the sample space.
For example, if our experiment is throwing dice and recording its outcome, the sample
space will be:
S1 = {1, 2, 3, 4, 5, 6}
What will be the sample when we’re tossing a coin?
S2 = {H, T}
[email protected] 5
EVENT
An event is a set of outcomes (i.e. a subset of the sample space) of an experiment.
Let’s get back to the experiment of rolling a dice and define events E and F as:
E = An even number is obtained = {2, 4, 6}
F = A number greater than 3 is obtained = {4, 5, 6}
The probability of these events:
[email protected] 7
RANDOM VARIABLE
Let’s take a simple example (refer to the above image as we go along). Define a
random variable X on the sample space of the experiment of tossing a coin. It takes a
value +1 if “Heads” is obtained and -1 if “Tails” is obtained. Then, X takes on values
+1 and -1 with equal probability of 1/2.
Consider that Y is the observed temperature (in Celsius) of a given place on a given
day. So, we can say that Y is a continuous random variable defined on the same space,
S = [0, 100] (Celsius Scale is defined from zero degree Celsius to 100 degrees
Celsius).
[email protected] 8
EXHAUSTIVE EVENTS
A set of events is said to be exhaustive if at least one of the events must occur at
any time. Thus, two events A and B are said to be exhaustive if A ∪ B = S, the
sample space.
For example, let’s say that A is the event that a card drawn out of a pack is red and
B is the event that the card drawn is black. Here, A and B are exhaustive because
the sample space S = {red, black}. Pretty straightforward stuff, right?
[email protected] 9
INDEPENDENT EVENTS
If the occurrence of one event does not have any effect on the occurrence of
another, then the two events are said to be independent. Mathematically, two events
A and B are said to be independent if:
[email protected] 10
CONDITIONAL PROBABILITY
Consider that we’re drawing a card from a given deck.
What is the probability that it is a black card?
That’s easy – 1/2, right?
However, what if we know it was a black card – then what would be the probability
that it was a king?
This is where the concept of conditional probability comes into play.
Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented by
P(A|B) and we can define it as:
P(A|B) = P(A ∩ B) / P(B)
[email protected] 11
CONDITIONAL PROBABILITY
Let event A represent picking a king, and event B, picking a black card. Then, we find
P(A|B) using the above formula:
[email protected] 12
WHAT IS BAYES’ THEOREM?
[email protected] 13
WHAT IS BAYES’ THEOREM?
“Have you ever seen the popular TV show ‘Sherlock’
(or any crime thriller show)? Think about it – our
beliefs about the culprit change throughout the
episode. We process new evidence and refine our
hypothesis at each step.
[email protected] 16
AN ILLUSTRATION OF BAYES’ THEOREM
Let’s solve a problem using Bayes’ Theorem. This will help you understand and
visualize where you can apply it.
There are 3 boxes labeled A, B, and C:
Box A contains 2 red and 3 black balls
Box B contains 3 red and 1 black ball
And box C contains 1 red ball and 4 black balls
The three boxes are identical and have an equal probability of getting picked.
Consider that a red ball is chosen. Then what is the probability that this red ball was
picked out of box A?
[email protected] 17
CONTD…
We have prior probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal
probability of getting picked.
[email protected] 19
NAIVE BAYES’ CLASSIFIERS
Naive Bayes’ Classifiers are a set of probabilistic classifiers based on
the Bayes’ Theorem. The underlying assumption of these classifiers is
that all the features used for classification are independent of each
other.
That’s where the name ‘naive’ comes in since it is rare that we obtain a
set of totally independent features.
The way these classifiers work is exactly how we solved in the
illustration, just with a lot more features assumed to be independent of
each other.
[email protected] 20
NAIVE BAYES’ CLASSIFIERS
Here, we need to find the probability P(Y|X) where X is an n-dimensional random
variable whose component random variables X_1, X_2, …., X_n are independent of
each other:
[email protected] 22
CONTD…
[email protected] 23
CONTD…
[email protected] 24
ADVANTAGES & DISADVANTAGES
Advantages of Naïve Bayes Classifier:
•Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
•It can be used for Binary as well as Multi-class Classifications.
•It performs well in Multi-class predictions as compared to the other Algorithms.
•It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
•Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
[email protected] 25
APPLICATIONS OF NAÏVE BAYES CLASSIFIER:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
[email protected] 26
TYPES OF NAÏVE BAYES MODEL
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
[email protected] 27
TYPES OF NAÏVE BAYES MODEL
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
[email protected] 28
TYPES OF NAÏVE BAYES MODEL
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or
not in a document. This model is also famous for document classification tasks.
BernoulliNB implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.
[email protected] 29
THANKS
[email protected] 30
MACHINE LEARNING (ML-12)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oDecision Tree
[email protected] 2
DECISION TREE
[email protected] 3
DECISION TREE
[email protected] 4
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
[email protected] 5
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant
[email protected] 6
EXAMPLE OF A DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
[email protected] 8
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
[email protected] 9
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
[email protected] 10
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
[email protected] 11
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
[email protected] 12
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
[email protected] 13
APPLY MODEL TO TEST DATA Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
[email protected] 14
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
[email protected] 15
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
[email protected] 16
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
[email protected] 17
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
[email protected] 18
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
[email protected] 19
HOW TO SPECIFY TEST CONDITION?
Depends on attribute types
Nominal
Ordinal
Continuous
[email protected] 20
SPLITTING BASED ON NOMINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
[email protected] 21
SPLITTING BASED ON ORDINAL ATTRIBUTES
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
Large} {Medium}
[email protected] 22
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
[email protected] 23
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
[email protected] 24
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
[email protected] 25
HOW TO DETERMINE THE BEST SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
[email protected] 26
HOW TO DETERMINE THE BEST SPLIT
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
[email protected] 27
MEASURES OF NODE IMPURITY
Entropy
Gini Index
Misclassification error
[email protected] 28
ENTROPY
[email protected] 29
ENTROPY
https://en.wikipedia.org/wiki/Entropy_(information_theory)
[email protected] 30
ENTROPY
[email protected] 31
ENTROPY
yes
no
[email protected] 32
HOW TO FIND THE BEST SPLIT
Before Splitting: C0 N00 M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34 [email protected] 33
MEASURE OF IMPURITY: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p ( j | t )]2
j
Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
[email protected] 34
EXAMPLES FOR COMPUTING GINI
GINI (t ) 1 [ p ( j | t )]2
j
[email protected] 35
EXAMPLE
[email protected] 36
EXAMPLE
[email protected] 37
EXAMPLE
[email protected] 38
EXAMPLE
[email protected] 39
EXAMPLE
[email protected] 40
EXAMPLE
[email protected] 41
EXAMPLE
[email protected] 42
EXAMPLE
[email protected] 43
EXAMPLE
[email protected] 44
EXAMPLE
[email protected] 45
EXAMPLE
[email protected] 46
EXAMPLE
[email protected] 47
EXAMPLE
[email protected] 48
EXAMPLE
[email protected] 49
EXAMPLE
[email protected] 50
EXAMPLE
[email protected] 51
EXAMPLE
[email protected] 52
EXAMPLE
[email protected] 53
EXAMPLE
[email protected] 54
EXAMPLE
[email protected] 55
EXAMPLE
[email protected] 56
EXAMPLE
[email protected] 57
EXAMPLE
[email protected] 58
EXAMPLE
[email protected] 59
EXAMPLE
[email protected] 60
EXAMPLE
[email protected] 61
EXAMPLE
[email protected] 62
EXAMPLE
[email protected] 63
EXAMPLE
[email protected] 64
EXAMPLE
[email protected] 65
EXAMPLE
[email protected] 66
EXAMPLE
[email protected] 67
EXAMPLE
[email protected] 68
THANKS
[email protected] 69
MACHINE LEARNING (ML-13)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oData preprocessing
[email protected] 2
What is Data?
a collection of number assigned as value to
quantitative variable and/ or characters
assigned as value to qualitative variables, or
collection of records and their attributes
Ordinal
Used to rank order cases
Example: ranking (eg. movie on scale of 1-10), height (tall, medium, short), grades
Interval
Example: Calendar dates, longitude, latitude
Ratio
Same as interval variable but they have a “true zero”
Example: time, length, population, age
4
Properties of Attribute values
The type of an attributes depends on which of the following properties it
possess:
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Nominal: Distinctness
Ordinal: Distinctness, Order
5
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
6
Type of data sets
Record Data
Data Matrix
Transaction data
Graph Data
World wide web
Molecular structure
Ordered
Spatial data
Temporal data
Sequential data
7
Record Data
Data that consists of a collection of records, each of which consists of fixed set
of attributes
8
Data Matrix
If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multidimensional space, where each
dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute
9
Data Matrix Example for Documents
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
10
Transaction Data
A typical type of record data, then
Each record (transaction) involves a set of items
11
Graph data
Example: Facebook graph and HTML links
12
Ordered data
Genetic sequence data
13
Data Quality
What kind of data quality problems?
How can we detect the problem with the data?
Missing values
Noise and outliers
Duplicate data
14
Data Quality: Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
15
Data Quality: Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on
16
Data Quality: Outliers
Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
17
Data Quality: Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one
another
Major issue when merging data from heterogenous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
18
Data Preprocessing
Imputation
Outlier management
Feature selection
19
Imputation (filling in) of missing data
Imputation is performed using a number of different algorithms, which can be
subdivided into single and multiple imputation methods.
Multiple-imputation methods
several likelihood- ordered choices for imputing the missing value are computed and one
“best” value is selected.
20
Imputation Contd…
Single imputation
Mean imputation
Hot deck imputation
Multiple imputation
21
Single imputation Contd…
Mean imputation
Mean imputation, also called unconditional mean
imputation, is a widely used imputation method
Mean imputation assumes that the mean of a
variable is the best estimate for any
case that has missing information on this variable
For continuous variable, each missing value is
imputed with the mean of known values for the
same variable
For categorical variable, the missing values of
are the mode of the observed values of same
variable
22
Advantages
fast,
simple,
Limitations
underestimation of the population variance
thus a small standard error
23
Single imputation Contd…
24
Advantages
preserves the population distribution
it is better than mean imputation
Limitations
distort correlations and covariances
25
Other type of Single imputation
Regression imputation
Cold-deck imputation
Sequential imputation
26
Multiple imputation
The idea of Multiple Imputation is to replace each missing value with multiple
acceptable values that represent a distribution of possibilities.
This results in a number of complete datasets (usually 3-10):
27
Outlier management
Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
Ex.: Unusual credit card purchase
Outliers are different from the noise data
Noise is random error or variance in a measured variable
28
Types of Outliers
Three kinds:
Global,
Contextual
Collective
29
Types of Outliers Contd…
Contextual outlier (or conditional outlier)
Object is O if it deviates significantly based on a selected context
c
o
Ex. 48 C in Mathura: outlier? (depending on summer or winter?)
Attributes of data objects should be divided into two groups to detect O
c
Contextual attributes: defines the context, e.g., time & location
30
Types of Outliers Contd…
Collective Outliers
A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
Applications: E.g., intrusion detection:
When a number of computers keep sending denial-of-service
packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that
of groups of objects
Need to have the background knowledge on the relationship
Unsupervised, and
Semi-supervised methods
proximity-based, and
clustering-based methods
32
Statistical Methods
Statistical methods (also known as model -based methods) assume that the normal data follow
some statistical model
The data not following the model are outliers.
Methods are divided into two categories: parametric vs. non-parametric
Parametric method
Assumes that the normal data is generated by a parametric distribution with parameter θ
The probability density function of the parametric distribution f(x, θ) gives the probability
Non-parametric method
Not assume an a-priori statistical model and determine the model from the input data
Not completely parameter free but consider the number and nature of the parameters are
Often assume that data are generated from a normal distribution, learn the parameters from the
input data, and identify the points with low probability as outliers
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Use the maximum likelihood method to estimate μ and σ
So, 24 is an outlier 34
Statistical Methods – Box Plot
Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are outliers
Consider the following dataset:
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9,
15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9 - 14.4 = 0.5
Outliers will be any points:
below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or
above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65
So, the outliers are at 10.2, 15.9, and 16.4. 35
Non-Parametric Methods: Detection Using Histogram
36
Proximity-Based Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object significantly deviates from the proximity of most of the
other objects in the same data set
Two types of proximity-based outlier detection methods
Distance-based outlier detection: An object o is an outlier if its neighborhood does not have
enough other points
Density-based outlier detection: An object o is an outlier if its density is relatively much
lower than that of its neighbors
37
Clustering-Based Outlier Detection
An object is an outlier if
it does not belong to any cluster,
One-Hot Encoding
39
Label Encoding
Label encoding assigns each unique value to a different integer.
This approach assumes an ordering of the categories:
Eg: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3)
This assumption makes sense in this example, because there is an indisputable ranking to the
categories. Not all categorical variables have a clear ordering in the values, but we refer to
those that do as ordinal variables.
For tree-based models (like decision trees and random forests), you can expect label
41
One-Hot Encoding Contd…
In contrast to label encoding, one-hot encoding does not assume an ordering of
the categories.
This approach to work particularly well if there is no clear ordering in the
categorical data (e.g., "Red" is neither more nor less than "Yellow").
We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable
takes on a large number of values (i.e., you generally won't use it for variables
taking more than 15 different values).
42
Feature selection
Features contain information about the target
Naïve view:
More feature
=>More information
=>More discrimination power
In practice:
Many reasons why this is not the case !
43
Feature selection Contd…
Curse of dimensionality
44
Feature selection Contd…
Data may contain many irrelevant and redundant variables (features) and often
comparably few (limited) training examples
Feature selection
A procedure in machine learning to find a subset of features that produces
‘better’ model for given dataset
Avoid overfitting and achieve better generalization ability
Reduce the storage requirement and training time
Interpretability
45
Feature Selection
Given a set of N features, the goal of feature selection is to select a subset of K
features (K << N) in order to minimize the classification error.
46
Feature Selection vs Feature Extraction
Feature Selection
New features represent a subset of the original features.
When classifying novel patterns, only a small number of features need to be computed (i.e.,
faster classification).
Feature Extraction
Projection to M<N dimension
New features are combinations (linear for PCA/LDA) of the original features (difficult to
interpret).
When classifying novel patterns, all features need to be computed.
47
Feature Selection: Main Steps
Feature selection is an optimization
problem.
48
Feature Selection: Main Steps (cont’d)
Search methods
Exhaustive
Heuristic
Randomized
Evaluation methods
Filter (Unsupervised)
Look at input only
Select the subset that has the most information
Wrapper (Supervised)
Train using selected subset
Estimate error on validation dataset
49
Search Methods
Assuming n features, an exhaustive search would require examining possible
subsets of size d.
In practice, heuristics are used to speed-up search but they cannot guarantee
optimality.
50
Evaluation Methods
Filter
Evaluation is independent of the classification
algorithm.
“goodness”
The objective function is based on the
information).
51
Evaluation Methods (cont’d)
Wrapper
Evaluation uses criteria related to
the classification algorithm.
“goodness”
The objective function is based
52
Filter vs Wrapper Methods (cont’d)
Filter Methods
Advantages
Much faster than wrapper methods since the objective function has lower computational
requirements.
The optimum feature set might work well with various classifiers as it is not tied to a
specific classifier.
Disadvantages
Achieve lower recognition accuracy compared to wrapper methods.
Have a tendency to select more features compared to wrapper methods.
53
Filter vs Wrapper Methods
Wrapper Methods
Advantages
Achieve higher recognition accuracy compared to filter methods since they use the classifier
itself in choosing the optimum set of features.
Disadvantages
Much slower compared to filter methods since the classifier must be trained and tested for
each candidate feature subset.
The optimum feature subset might not work well for other classifiers.
54
Forward selection
(heuristic search)
Estimate the classification/ regression error for adding each specific feature
55
Backward selection (BS)
(heuristic search)
56
THANKS
[email protected] 57
MACHINE LEARNING (ML-14)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oMachine Learning
oTypes of Machine Learning
oUnsupervised Learning : Clustering
oUnsupervised Example
oApplications of Clustering
oK-Means Algorithm
oK-Means Examples
[email protected] 2
WHEN DO WE USE MACHINE LEARNING?
ML is used when:
• Human expertise does not exist (navigating on Mars)
[email protected] 7
SUPERVISED VS UNSUPERVISED LEARNING
Supervised learning: discover patterns in the data that relate data attributes
with a target (class) attribute.
Patterns are then utilized to predict the values of the target attribute in future data instances.
[email protected] 8
UNSUPERVISED EXAMPLE
A bank wants to give credit card offers to its customers. Currently, they look at the details of
each customer and based on this information, decide which offer should be given to which
customer.
[email protected] 10
APPLICATIONS OF CLUSTERING IN REAL-WORLD
Image Segmentation
Recommendation Engines
[email protected] 12
K-MEANS CLUSTERING
[email protected] 13
K-MEANS CLUSTERING
[email protected] 14
K-MEANS CLUSTERING
[email protected] 15
K-MEANS CLUSTERING
[email protected] 16
K-MEANS CLUSTERING
[email protected] 17
K-MEANS CLUSTERING
[email protected] 18
K-MEANS ALGORITHM
[email protected] 19
K-MEANS ALGORITHM
[email protected] 20
EXAMPLE
Divide the given sample data in two (2) clusters using K-Means algorithm using
Euclidean Distance.
Sno. Height(H) Weight
(W)
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76 [email protected] 21
K-MEANS FOR NON-SEPARATED CLUSTERS
[email protected] 22
EXAMPLE
[email protected] 23
SOLUTION
[email protected] 24
SOLUTION
[email protected] 25
SOLUTION
[email protected] 26
SOLUTION
[email protected] 27
SOLUTION
[email protected] 28
SOLUTION
[email protected] 29
SOLUTION
[email protected] 30
SOLUTION
[email protected] 31
SOLUTION
[email protected] 32
SOLUTION
[email protected] 33
VISUALIZATION OF SOLUTION
[email protected] 34
VISUALIZATION OF SOLUTION
[email protected] 35
HOW TO CHOOSE K: ELBOW METHOD
It is based on the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids.
[email protected] 36
DRAWBACKS
K-means algorithm is good in capturing structure of the data if clusters have
a spherical-like shape.
It always try to construct a nice spherical shape around the centroid.
The clusters which have a complicated geometric shapes, k-means does a
poor job in clustering the data.
[email protected] 37
CLUSTERING QUALITY
Ideal clustering is characterized by minimal intra cluster distance and maximal
inter cluster distance.
There are majorly two types of measures to assess the clustering performance.
(i) Extrinsic Measures which require ground truth labels.
Rand index
(ii) Intrinsic Measures that does not require ground truth labels.
Silhouette Coefficient
[email protected] 38
BASIC CLUSTERING METHODS
1. Partitioning methods
k-means
2. Hierarchical methods
Agglomerative (bottom-up) or divisive (top-down)
3. Density-based methods
4. Grid-based methods
[email protected] 39
OVERVIEW OF CLUSTERING METHODS
[email protected] 40
HIERARCHICAL METHODS
A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data summarization
and visualization.
Agglomerative methods start with individual objects as clusters, which are iteratively
merged to form larger clusters.
Divisive methods initially let all the given objects form one cluster, which they
iteratively split into smaller clusters.
Hierarchical clustering methods can encounter difficulties regarding the selection of
merge or split points.
[email protected] 41
AGGLOMERATIVE VERSUS DIVISIVE HIERARCHICAL
CLUSTERING
A hierarchical clustering method can be either agglomerative or divisive, depending
on whether the hierarchical decomposition is formed in a bottom-up (merging) or top
down (splitting) fashion.
[email protected] 43
AGGLOMERATIVE HIERARCHICAL CLUSTERING
We assign each point to an individual cluster in this technique.
Suppose there are 4 data points. We will assign each of these points to a cluster and
hence will have 4 clusters in the beginning:
[email protected] 44
AGGLOMERATIVE HIERARCHICAL CLUSTERING
Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster
is left:
[email protected] 45
DIVISIVE HIERARCHICAL CLUSTERING
Divisive hierarchical clustering works in the opposite way.
Instead of starting with n clusters (in case of n observations), we start with a single
cluster and assign all the points to that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong
to the same cluster at the beginning:
[email protected] 46
DIVISIVE HIERARCHICAL CLUSTERING
Now, at each iteration, we split the farthest point in the cluster and repeat this
process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive
hierarchical clustering.
[email protected] 47
STEPS TO PERFORM HIERARCHICAL CLUSTERING
We merge the most similar points or clusters in hierarchical clustering – we know this. Now the
question is –
?
Distance-based algorithm
[email protected] 48
STEPS TO PERFORM HIERARCHICAL CLUSTERING
In hierarchical clustering, we have a concept called a proximity matrix. This stores
the distances between each point.
Suppose a teacher wants to divide her students into different groups. She has the
marks scored by each student in an assignment and based on these marks, she wants
to segment them into groups. There’s no fixed target here as to how many groups to
have. Since the teacher does not know what type of students should be assigned to
which group, it cannot be solved as a supervised learning problem. So, we will try to
apply hierarchical clustering here and segment the students into different groups.
Let’s take a sample of 5 students:
[email protected] 49
CREATING A PROXIMITY MATRIX
Let’s make the 5 x 5 proximity matrix for our example:
[email protected] 50
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Step 1: First, we assign all the points to an individual cluster
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge
the points with the smallest distance. We then update the proximity matrix:
[email protected] 51
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Here, the smallest distance is 3 and hence we will merge point 1 and 2.
[email protected] 52
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Let’s look at the updated clusters and accordingly update the proximity matrix:
[email protected] 53
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Now, we will again calculate the proximity matrix for these clusters:
[email protected] 54
STEPS TO PERFORM HIERARCHICAL CLUSTERING
Step 3: We will repeat step 2 until only a single cluster is left.
[email protected] 56
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
Whenever two clusters are merged, we will join them in this dendrogram and the
height of the join will be the distance between these points.
[email protected] 57
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
[email protected] 58
HOW SHOULD WE CHOOSE THE NUMBER OF
CLUSTERS IN HIERARCHICAL CLUSTERING?
More the distance of the vertical lines in the dendrogram, more the distance
between those clusters. Now, we can set a threshold distance and draw a horizontal
line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical
line)
[email protected] 59
THANKS
[email protected] 60
MACHINE LEARNING (ML-15)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oSupport Vector Machine
oTypes of SVM
oHyperplane
oSupport Vectors
oLinear SVM Mathematically
oExamples
oPros and Cons
[email protected] 2
INTRODUCTION
•Support Vector Machine abbreviated as SVM
[email protected] 3
INTRODUCTION
•SVM algorithm can be used for Face detection, image
classification, text categorization, etc.
[email protected] 4
WHAT IS SUPPORT VECTOR MACHINE?
•A support vector machine is a machine learning model that is able to
generalize between two different classes if the set of labelled data is
provided in the training set to the algorithm.
[email protected] 5
TYPES OF SVM
1. Linear SVM:
•Linear SVM is used for linearly separable data.
•It means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data.
•The classifier is used called as Linear SVM classifier.
[email protected] 6
TYPES OF SVM
2. Non-linear SVM:
•Non-Linear SVM is used for non-linearly separated data.
•It means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data.
•The classifier used is called as Non-linear SVM classifier.
[email protected] 7
SUPPORT VECTOR MACHINE
[email protected] 8
SUPPORT VECTOR MACHINE
[email protected] 9
SUPPORT VECTOR MACHINE
•To separate the two classes of data points, there are many
possible hyperplanes that could be chosen.
•The objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes.
[email protected] 11
HYPERPLANES
[email protected] 12
HYPERPLANES
•Identify the right hyper-plane (Scenario-1):
[email protected] 13
HYPERPLANES
Identify the right hyper-plane (Scenario-2):
[email protected] 14
HYPERPLANES
Identify the right hyper-plane (Scenario-3):
[email protected] 15
HYPERPLANES
Can we classify two classes (Scenario-4)?
[email protected] 16
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):
[email protected] 17
HYPERPLANES
Find the hyper-plane to segregate to classes (Scenario-5):In the scenario below, we
can’t have linear hyper-plane between the two classes, so how does SVM classify
these two classes?
[email protected] 18
HYPERPLANES
The SVM algorithm has a technique called the kernel trick. The SVM kernel is
a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable
problem.
[email protected] 19
SUPPORT VECTORS
•Support vectors are data points that are closer to the hyperplane and influence
the position and orientation of the hyperplane.
•Deleting the support vectors will change the position of the hyperplane.
[email protected] 20
SUPPORT VECTORS
[email protected] 21
Sec. 15.1
22
Sec. 15.1
GEOMETRIC MARGIN
wT x b
Distance from example to the separator is ry
w
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between support vectors of
classes.
x ρ
r
x′
w
23
Sec. 15.1
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
24
Sec. 15.1
ρ wTxa + b = 1
Hyperplane
wT x + b = 0 wTxb + b = -1
This implies:
wT x + b = 0
wT(xa–xb) = 2
ρ = ||xa–xb||2 = 2/||w||2
25
Sec. 15.1
26
SVM EXAMPLE
[email protected] 27
SVM EXAMPLE
[email protected] 28
SVM EXAMPLE
[email protected] 29
SVM EXAMPLE
[email protected] 30
SVM EXAMPLE
[email protected] 31
SVM EXAMPLE
[email protected] 32
SVM EXAMPLE
[email protected] 33
EXAMPLE
[email protected] 34
SVM EXAMPLE
[email protected] 35
SVM EXAMPLE
[email protected] 36
SVM EXAMPLE
[email protected] 37
SVM EXAMPLE
[email protected] 38
PROS AND CONS ASSOCIATED WITH SVM
Pros:
•It works really well with a clear margin of separation
•It is effective in high dimensional spaces.
•It is effective in cases where the number of dimensions is greater than the
number of samples.
•It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
[email protected] 39
PROS AND CONS ASSOCIATED WITH SVM
Cons:
•It doesn’t perform well when we have large data set because the required training
time is higher.
•It doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping.
[email protected] 40
THANKS
[email protected] 41
MACHINE LEARNING (ML-15)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oPrincipal Component Ananlysis (PCA)
[email protected] 2
MOTIVATION
Clustering
One way to summarize a complex real-valued data point with a
single categorical variable
Dimensionality reduction
Another way to simplify complex high-dimensional data
Summarize data with a lower dimensional real valued vector
MOTIVATION
Clustering
One way to summarize a complex real-valued data point with a single
categorical variable
Dimensionality reduction
Another way to simplify complex high-dimensional data
Summarize data with a lower dimentional real valued vector
• Given data points in d dimensions
• Convert them to data points in r<d dimensions
• With minimal loss of information
NEED FOR PCA
High dimension data is extremely complex to process due to inconsistencies in the feature which increase
the computation time.
[email protected] 5
NEED FOR PCA
[email protected] 6
Data Compression
(cm)
Andrew Ng
Data Compression
Reduce data from
2D to 1D
(inches)
(cm)
Andrew Ng
Data Compression
Reduce data from 3D to 2D
Andrew Ng
Principal Component Analysis (PCA) problem formulation
Andrew Ng
STEP BY STEP PCA
[email protected] 11
[email protected] 12
COVARIANCE
Variance and Covariance:
Measure of the “spread” of a set of points around their center of mass(mean)
Variance:
Measure of the deviation from the mean for points in one dimension
Covariance:
Measure of how much each of the dimensions vary from the mean with respect to
each other
Example
EIGENVECTOR AND EIGENVALUE
Ax - λx = 0
Ax = λx
(A – λI)x = 0
[email protected] 25
PRACTICE PROBLEMS BASED ON PCA
Step-01:
Get data.
The given feature vectors are-
x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
[email protected] 26
PRACTICE PROBLEMS BASED ON PCA
Step-02:
Calculate the mean vector (µ).
[email protected] 27
PRACTICE PROBLEMS BASED ON PCA
Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
[email protected] 28
PRACTICE PROBLEMS BASED ON PCA
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-
[email protected] 29
PRACTICE PROBLEMS BASED ON PCA
[email protected] 30
PRACTICE PROBLEMS BASED ON PCA
Now,
Covariance matrix
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
[email protected] 31
PRACTICE PROBLEMS BASED ON PCA
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI|
= 0.
So, we have-
[email protected] 32
PRACTICE PROBLEMS BASED ON PCA
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
[email protected] 34
PRACTICE PROBLEMS BASED ON PCA
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
From (1) and (2), X1 = 0.69X2
From (2), the eigen vector is-
[email protected] 35
PRACTICE PROBLEMS BASED ON PCA
Thus, principal component for the given data set is-
[email protected] 36
PRACTICE PROBLEMS BASED ON PCA
Lastly, we project the data points onto the new subspace as-
[email protected] 37
PRACTICE PROBLEMS BASED ON PCA
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
The given feature vector is (2, 1).
[email protected] 38
PRACTICE PROBLEMS BASED ON PCA
The feature vector gets transformed to
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
[email protected] 39
SIFT feature visualization
• The top three principal components of SIFT descriptors from a set of images are computed
• Map these principal components to the principal components of the RGB space
• pixels with similar colors share similar structures
Application: Image compression
ORIGINAL IMAGE
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
PCA COMPRESSION: 144D ) 6D
6 MOST IMPORTANT EIGENVECTORS
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2
4 4 4
6 6 6
8 8 8
10 10 10
12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
PCA COMPRESSION: 144D 3D
3 most important eigenvectors
2 2
4 4
6 6
8 8
10 10
12 12
2 4 6 8 10 12 2 4 6 8 10 12
10
12
2 4 6 8 10 12
PCA COMPRESSION: 144D 1D
60 most important eigenvectors
http://en.wikipedia.org/wiki/Discrete_cosine_transform
DIMENSIONALITY REDUCTION
PCA (Principal Component Analysis):
Find projection that maximize the variance
Multidimensional Scaling:
Find projection that best preserves inter-point distances
…
…
THANKS
[email protected] 57
MACHINE LEARNING (ML-17)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oEnsemble Methods
[email protected] 2
INTRODUCTION
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating
an improved model M*
INTRODUCTION
Two most popular ensemble methods are bagging and boosting.
Bagging: Training a bunch of individual models in a parallel way. Each model
is trained by a random subset of the data.
averaging the prediction over a collection of classifiers
1 error ( M i )
log
The weight of classifier Mi’s vote (αM) is error ( M i )
REFERENCES
Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and
techniques. Elsevier, 2011.
[email protected] 14
MACHINE LEARNING (ML-18)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oArtificial Neural Network
History of Artificial Neural Networks
What is an Artificial Neural Networks?
How it works?
Learning
[email protected] 2
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
History of the ANNs stems from the 1940s, the decade of the first electronic
computer.
However, the first important step took place in 1957 when Rosenblatt introduced the
first concrete neural model, the perceptron. Rosenblatt also took part in constructing
the first successful neurocomputer, the Mark I Perceptron. After this, the
development of ANNs has proceeded as described in Figure.
[email protected] 3
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
Rosenblatt's original perceptron model contained only one layer. From this, a multi-
layered model was derived in 1960. At first, the use of the multi-layer perceptron
(MLP) was complicated by the lack of a appropriate learning algorithm.
In 1974, Werbos came to introduce a so-called backpropagation algorithm for the
three-layered perceptron network.
[email protected] 4
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1986, The application area of the MLP networks remained rather limited until the
breakthrough when a general back propagation algorithm for a multi-layered
perceptron was introduced by Rummelhart and Mclelland.
In 1982, Hopfield brought out his idea of a neural network. Unlike the neurons in
MLP, the Hopfield network consists of only one layer whose neurons are fully
connected with each other.
[email protected] 5
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
Since then, new versions of the Hopfield network have been developed. The
Boltzmann machine has been influenced by both the Hopfield network and
the MLP.
[email protected] 6
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1988, Radial Basis Function (RBF) networks were first introduced by
Broomhead & Lowe. Although the basic idea of RBF was developed 30 years
ago under the name method of potential function, the work by Broomhead &
Lowe opened a new frontier in the neural network community.
[email protected] 7
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
In 1982, A totally unique kind of network model is the Self-Organizing Map
(SOM) introduced by Kohonen. SOM is a certain kind of topological map
which organizes itself based on the input patterns that it is trained with. The
SOM originated from the LVQ (Learning Vector Quantization) network the
underlying idea of which was also Kohonen's in 1972.
[email protected] 8
HISTORY OF THE ARTIFICIAL NEURAL NETWORKS
[email protected] 9
ARTIFICIAL NEURAL NETWORK
A set of major aspects of a parallel distributed model
include:
a set of processing units (cells).
a state of activation for every unit, which equivalent to the output of the
unit.
connections between the units. Generally each connection is defined by a
weight.
a propagation rule, which determines the effective input of a unit from its
external inputs.
[email protected] 10
ARTIFICIAL NEURAL NETWORK
an activation function, which determines the new level of
activation based on the effective input and the current activation.
an external input for each unit.
a method for information gathering (the learning rule).
an environment within which the system must operate, providing
input signals and _ if necessary _ error signals.
[email protected] 11
COMPUTERS VS. NEURAL NETWORKS
“Standard” Computers Neural Networks
[email protected] 12
WHY ARTIFICIAL NEURAL NETWORKS?
There are two basic reasons why we are interested in building artificial neural
networks (ANNs):
• Technical viewpoint: Some problems such as character recognition or the
prediction of future states of a system require massively parallel and adaptive
processing.
• Biological viewpoint: ANNs can be used to replicate and simulate
components of the human (or animal) brain, thereby giving us insight into
natural information processing.
[email protected] 13
ARTIFICIAL NEURAL NETWORK
•The “building blocks” of neural networks are
the neurons.
• In technical systems, we also refer to them as units or nodes.
[email protected] 15
HOW DO ANNS WORK?
An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to simulate
the information processing capabilities of its biological exemplar.
[email protected] 16
HOW DO OUR BRAINS WORK?
The Brain is A massively parallel information processing
system.
Our brains are a huge network of processing elements. A
typical brain contains a network of 10 billion neurons.
[email protected] 17
HOW DO OUR BRAINS WORK?
A processing element
Dendrites: Input
Cell body: Processor
Synaptic: Link
[email protected] 18
Axon: Output
How do our brains work?
A processing element
[email protected] 19
How do our brains work?
A processing element
[email protected] 20
How do our brains work?
A processing element
The axon endings almost touch the dendrites or cell body of the
next neuron.
[email protected] 22
How do our brains work?
A processing element
[email protected] 23
How do our brains work?
A processing element
Neurotransmitters are chemicals which are released from the first neuron
and which bind to the
Second.
[email protected] 24
How do our brains work?
A processing element
[email protected] 27
How do ANNs work?
............
Input xm x2 x1
Processing ∑
∑= X1+X2 + ….+Xm =y
Output y
[email protected] 28
How do ANNs work?
Not all inputs are equal
............
xm x2 x1
Input
wm .....
weights w2 w1
Output y
[email protected] 29
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
............
xm x2 x1
Input
wm ..... w2
weights w1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
[email protected] 30
The output is a function of the input, that is
affected by the weights, and the transfer
functions
[email protected] 31
Activation Functions:
[email protected] 32
Forward propagation: Vectorized implementation
[email protected] 33
Neural Network learning its own features
[email protected] 34
Other Network Architecture
[email protected] 35
Simple example: AND
[email protected] 36
Simple example: OR
[email protected] 37
Artificial Neural Networks
An ANN can:
1. compute any computable function, by the
appropriate selection of the network
topology and weights values.
[email protected] 38
Learning by trial‐and‐error
Continuous process of:
Trial:
Processing an input to produce an output (In terms of ANN: Compute
the output function of a given input)
Evaluate:
Evaluating this output by comparing the actual output with
the expected output.
Adjust:
Adjust the weights.
[email protected] 39
How it works?
Set initial values of the weights randomly.
Input: truth table of the XOR
Do
Read input (e.g. 0, and 0)
Compute an output (e.g. 0.60543)
Compare it to the expected output. (Diff= 0.60543)
Modify the weights accordingly.
Loop until a condition is met
Condition: certain number of iterations
Condition: error threshold
[email protected] 40
Design Issues
Initial weights (small random values ∈[‐1,1])
Transfer function (How the inputs and the weights are
combined to produce output?)
Error estimation
Weights adjusting
Number of neurons
Data representation
Size of training set
[email protected] 41
Transfer Functions
Linear: The output is proportional to the total weighted
input.
[email protected] 42
Error Estimation
The root mean square error (RMSE) is a frequently-used
measure of the differences between values predicted by a
model or an estimator and the values actually observed from
the thing being modeled or estimated
[email protected] 43
Weights Adjusting
After each iteration, weights should be adjusted to
minimize the error.
– All possible weights
– Back propagation
[email protected] 44
Back Propagation
Back-propagation is an example of supervised learning is used at
each layer to minimize the error between the layer’s response and
the actual data
The error at each hidden layer is an average of the evaluated error
Hidden layer networks are trained this way
[email protected] 45
ANN EXAMPLE
[email protected] 46
BACKPROPAGATION STEP BY STEP
[email protected] 47
BACKPROPAGATION STEP BY STEP
• Neural network training is about finding weights that minimize prediction error. We
usually start our training with a set of randomly generated weights.
• Then, backpropagation is used to update the weights in an attempt to correctly map
arbitrary inputs to outputs.
[email protected] 48
BACKPROPAGATION STEP BY STEP
initial weights will be as following: w1 = 0.11, w2 = 0.21, w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15
[email protected] 49
BACKPROPAGATION STEP BY STEP
Dataset
Our dataset has one sample with two inputs and one output.
[email protected] 50
BACKPROPAGATION STEP BY STEP
Forward Pass
We will use given weights and inputs to predict the output. Inputs are multiplied by weights;
the results are then passed forward to next layer.
[email protected] 51
BACKPROPAGATION STEP BY STEP
Calculating Error
Now, it’s time to find out how our network performed by calculating the difference between the
actual output and predicted one. It’s clear that our network output, or prediction, is not even
close to actual output. We can calculate the difference or the error as following.
[email protected] 52
BACKPROPAGATION STEP BY STEP
Reducing Error
• The main goal of the training is to reduce the error or the difference
between prediction and actual output. Since actual output is constant,
“not changing”, the only way to reduce the error is to
change prediction value.
• how to change prediction value?
• It calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network.
[email protected] 57
BACKPROPAGATION STEP BY STEP
So to update w6 we can apply the following formula
Similarly, we can derive the update formula for w5 and any other weights existing between the output and
the hidden layer.
[email protected] 58
BACKPROPAGATION STEP BY STEP
However, when moving backward to update w1, w2, w3 and w4 existing between input and hidden
layer, the partial derivative for the error function with respect to w1, for example, will be as following.
[email protected] 59
BACKPROPAGATION STEP BY STEP
We can find the update formula for the remaining weights w2, w3 and w4 in the same way.
In summary, the update formulas for all weights will be as following:
[email protected] 60
BACKPROPAGATION STEP BY STEP
We can rewrite the update formulas in matrices as following
[email protected] 61
BACKPROPAGATION STEP BY STEP
Backward Pass
Using derived formulas we can find the new weights.
Learning rate: is a hyperparameter which means that we need to manually guess its value.
[email protected] 62
BACKPROPAGATION STEP BY STEP
Now, using the new weights we will repeat the forward passed
[email protected] 63
BACKPROPAGATION STEP BY STEP
https://hmkcode.com/netflow/
[email protected] 64
Applications Areas
Function approximation
including time series prediction and modeling.
Classification
including patterns and sequences recognition, novelty
detection and sequential decision making.
(radar systems, face identification, handwritten text recognition)
Data processing
including filtering, clustering blinds source separation and
compression.
(data mining, e-mail Spam filtering)
[email protected] 65
Advantages / Disadvantages
Advantages
Adapt to unknown situations
Powerful, it can model complex functions.
Ease of use, learns by example, and very little user
domain‐specific expertise needed
Disadvantages
Forgets
Not exact
Large complexity of the network structure
[email protected] 66
Conclusion
Artificial Neural Networks are an imitation of the biological
neural networks, but much simpler ones.
The computing would have a lot to gain from neural networks.
Their ability to learn by example makes them very flexible and
powerful furthermore there is need to device an algorithm in
order to perform a specific task.
[email protected] 67
Conclusion
Neural networks also contributes to area of research such a
neurology and psychology. They are regularly used to model
parts of living organizations and to investigate the internal
mechanisms of the brain.
Many factors affect the performance of ANNs, such as the
transfer functions, size of training sample, network topology,
weights adjusting algorithm, …
[email protected] 68
Q. How does each neuron work in ANNS?
What is back propagation?
A neuron: receives input from many other neurons;
changes its internal state (activation) based on the
current input;
sends one output signal to many other neurons, possibly
including its input neurons (ANN is recurrent network).
[email protected] 69
THANKS
[email protected] 70
MACHINE LEARNING (ML-19)
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oArtificial Neural Network
Gradient Descent
Stochastic Gradient Descent
Gradient Descent Vs Stochastic Gradient Descent
[email protected] 2
OPTIMIZATION
Optimization is always the ultimate goal whether you are dealing with a real life
problem or building a software product.
Optimization basically means getting the optimal output for your problem.
[email protected] 5
GRADIENT DESCENT ALGORITHM AND ITS VARIANTS
Gradient Descent is an optimization algorithm used for minimizing the cost
function in various machine learning algorithms. It is basically used for
updating the parameters of the learning model.
[email protected] 7
BATCH GRADIENT DESCENT
[email protected] 8
STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent:
• This is a type of gradient descent which processes 1 training example per
iteration.
• Hence, the parameters are being updated even after one iteration in which
only a single example has been processed.
• Hence this is quite faster than batch gradient descent.
• But again, when the number of training examples is large, even then it
processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
[email protected] 9
STOCHASTIC GRADIENT DESCENT
[email protected] 10
MINI BATCH GRADIENT DESCENT:
Mini Batch gradient descent:
• This is a type of gradient descent which works faster than both batch
gradient descent and stochastic gradient descent.
• Here b examples where b<m are processed per iteration.
• So even if the number of training examples is large, it is processed in
batches of b training examples in one go.
• Thus, it works for larger training examples and that too with lesser number
of iterations.
[email protected] 11
MINI BATCH GRADIENT DESCENT:
[email protected] 12
THANKS
[email protected] 13
MACHINE LEARNING (ML-20)
AN INTRODUCTION TO : DEEP LEARNING
Dr. NEERAJ GUPTA, Department of CEA, GLA University, Mathura
AGENDA
oAn introduction to: Deep Learning
aka or related to
Deep Neural Networks
Deep Structural Learning
Deep Belief Networks
etc,
[email protected] 2
Machine Learning Basics
Machine learning is a field of computer science that gives computers the ability to
learn without being explicitly programmed
Machine Learning
Labeled Data algorithm
Training
Prediction
class A
class A
Anomaly Detection
Sequence labeling
http://mbjoseph.github.io/2013/11/27/measure.html
…
ML vs. Deep Learning
Most machine learning methods work well because of human-designed representations
and input features
ML becomes just optimizing weights to best make a final prediction
What is Deep Learning (DL) ?
A machine learning subfield of learning representations of data. Exceptional effective
at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a
hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and respond in
useful ways.
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-specified, incomplete and take a long
time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable framework for
representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
= ( + )
= ( + )
Activation functions
How do we train?
Demo
Training
Sample labeled Forward it Back-
data through the Update the
propagate the
network, get network weights
(batch) errors
predictions
learning rate
( ; , )
Activation functions
Non-linearities needed to learn complex (non-linear) representations of data, otherwise
the NN would be just a linear function
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
http://adilmoujahid.com/images/activation.png
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
• gradient at these regions almost zero
• almost no signal will flow to its weights
• if initial weights are too large then most neurons would saturate
Activation: Tanh
Takes a real-valued number and
“squashes” it into range between -1 and
1.
→ −1,1
http://adilmoujahid.com/images/activation.png
http://adilmoujahid.com/images/activation.png
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
Regularization
Dropout
• Randomly drop units (along with their connections)
during training
• Each unit retained with fixed probability p,
independent of other units
• Hyper-parameter p to be chosen (tuned)
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." Journal of machine learning research (2014)
L2 = weight decay
• Regularization term that penalizes big weights, added to the
objective = +
• Weight decay value determines how dominant regularization is during
gradient computation
• Big weight decay coefficient big penalty for big weights
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Loss functions and output
Classification Regression
f(x)=x
Convolutional
Input matrix 3x3 filter
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Convolutional Neural Networks (CNNs)
Main CNN idea for text:
Compute vectors for n-grams and group them afterwards
max pool
2x2 filters
and stride 2
https://shafeentejani.github.io/assets/images/pooling.gif
CNN for text classification
Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment
Classification." SemEval@ NAACL-HLT. 2015.
CNN with multiple filters
https://pbs.twimg.com/media/C2j-8j5UsAACgEK.jpg
( ) ( )
ℎ = ( ℎ + )
( ) ( )
ℎ = ( ℎ + )
= ℎ ;ℎ