SlideShare a Scribd company logo
Machine Learning Concepts
Machine Learning has various types tribes:
— Symbolists discover new Knowledge by filling the missing info i.e. predict the categories (logical or numerical)
through math programs (Inverse Deduction) — Tom Mitchel [ the Biologist Robot ]
— Evolutionary Biologist starts with basic knowledge , formulates hypothesis using Inverse Deduction, designs
Models and runs them. They simulate the evolution and performs ‘structure discovery’ through Genetic Programming
Robots live in Simulated world, evolves Robots (in each generation fit tests orbit gets a chance to 3D print the next
Robot)
— Connectionists (Neuroscientists) emulate the brains using Backpropagation — Geoff Hintonn
It best solves Credit Assignment problem (correct the credits) using Backpropagation
‘Google Cat Network’ learnt cats from Youtube Videos.
— Bayesians systematically reduce uncertainty using (Probablistic Inference)
It best at predicting chance of happening based on more evidence say based on the email as Evidence, find prob of
2 hypothesis (H1 - spam, H2 - not spam)
— Analogizers detect similarities between past and present through (reasoning by analogy) using Kernel Machines
It learns from Similarity (needs much less data as they can generalize lots of data)
Well that was an interesting perspective on Machine Learning.
Lets now focus on the most common goal of an ML Algo.
ML Algorithms usually solve an optimization problem such that - we need to find parameters for a given model that
minimizes
— Loss function (prediction error)
— Model simplicity (regularization)
A Concept is a function or mapping from objects to membership . A mapping between objects in the world and
membership in a set.
An Instance is a Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an
income)
A Target Concept is the actual answer thats being searched in the space of multiple concepts.
A Hypothesis: helps to predict target concept (actual answer)
— we apply candidate concepts to testing set (should include lots of examples)
— we apply inductive learning to choose a hypothesis from given set of examples
We need to ask some relevant questions to choose a a Hypothesis !
What’s the Inductive Bias for the Classification Function ?
>> Inductive Bias helps us find a General Rule from example.
>> Generalization is the whole point in Machine Learning
Whats the Occum’s Razor ?
>> Prefer simplest hypothesis that fits data
What’s the Restriction Bias ?
>> Consider only those hypothesis which can be represented by chosen algorithm
Supervised classification => Function Approximation : predicting outcome when we know the different classifications
example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length
Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different
classifications.
example: splitting all data for sepal width/length into different groups (cluster similar data together)
Reinforcement Learning => Learning from Delayed Reward.
Eager & Lazy Learners :
Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets
find a function that best fits training data i.e. spend time to learn from data , when new inputs are
received the input features are fed into the function here we consider global scale inputs and avoid
local sensitivities
Lazy Learners : lazy learners do not compute a function to fit the training data before new data is
received so we save significant time upfront new instances are compared to the training data to
make a classification / regression decision !!! considers local-scale estimation .
MLAlgo Preference Bias Learning Function Performance Enhancements Usage
Bayesian
(Eager
Learner)
- Classification
Prior Domain
Knowledge
~ Pr (h) prior prob
for each candidate
h
~ Pr(D) – prob
dist. Over
observed data for
each h
Occum’s Razor ?
- select h with min
length
** at least one
maximally
probable
hypothesis
hmap ->
argmaxP(h|D)
-> argmaxP(D|h)
(for uniform prior)
Posterior Prob
P(h|D) = P(D|h).P(h)
/ P(D)
Key assumption :
every hi equally
probably a priori =>
p(hi) = p(hj)
* Noise Free
Uniformly Dist.
Hypothesis in V(s) *
P(h) = 1 / |H| ,
P(D|h) = { 1 if di =
h(x) , 0 otherwise }
P(h|D) = 1 / |V(s)|
* Noisy Data*
di = k.xi
hmc = argmax P (h|
D)
= argmax P(D|h)
= argmax π P(di|h)
* di = f(xi) + e
ln (hmc) = argmin
Sum (di – hi(x))2
* Vmap = argmaxv
Sumh P(v|h).P(h|D)
Cons :
* significant
computational
cost to find
Bayes optimal
hypothesis
* sometimes
huge no of
hypothesis
need to be
surveyed .
* NB handles
missing data
very well: it just
excludes the
attribute with
missing data
when computing
posterior
probability (i.e.
probability of
class given data
point)
Pros : No need to be
aware of given
hypothesis
— for smaller training
set, NB is good bet !
* Use Bayesian
Learning to
represent
Conditional
Independence
of variables !
* Assumes real-
valued
attributes are
normally
distributed. As
a result, NB can
only have
linear, elliptic,
or parabolic
decision
boundaries.
* Example:
misclassificatio
n , pruning ,
fitting errors
* spam
/ | 
Lottery Bank
College
P(spam | lottery ,
not bank , not
college) = p(vi).
Πi P (ai | v)
Algo
Decision Tree :
(Eager Learner)
ID3 , C4.5
approximate
discrete values
functions
disjunction of
conjunction of
constraints on attr
values
Description
Classification
: for discrete input
data
: for cont. input
data (consider
Range selection as
condition -
>20% )
Preference Bias
Occum’s Razor ?
: shorter tree
Other Biases :
: prefer attributes
with many
possible values
: prefer trees that
places high info
gain attrs close to
root (attr with best
answers NOT best
splits)
Learning Function
Info Gain (S,A) =
Entropy(S) – Sumv
(|Sv| / |
S|)*Entropy(Sv)
** wtd sum of
entropies of
partitions
* Entropy(s) =
-Sum(Pv log(Pv))
Performance
Usual problem
of Dtree : for N
variables 2N
combinations of
rows
(2)2-to-the-power-N
outputs
** so instead of
iterating on all
rows , first work
upon only the
attributes which
have highest info
gain.
** handles
noise , handles
missing values
============
=
Scope of
improvement :
Decision trees,
however, often
achieve lower
generalization
accuracy,
compared to
other learning
methods, such as
support vector
machines and
neural networks.
One common
way to improve
their accuracy is
boosting
Enhancement
pros : computes best
attribute in one move
cons :
* does not look ahead
or behind ( this problem
is solved by Hill-
Climbing …)
* tends to overfit as it
looks into many
diferent combinations
of features
* logistic regression
avoids overfitting more
elegantly
** Overfitting soln for
DTree :
>> stop growing tree
before it grows too
large
>> prune after certain
threshold
* consider
interdependency betn
attributes P(Y=y |
X=x)
* consider GainRatio ,
SplitInfo
Usage
- restaurant
selection
decision based
on cost, menu ,
appetite,
weather, and
other features.
-
Decision Tree :
Regression
Classification
: for cont. output
data
Lazy Distance-
based learning
func :
For each training
sample sl -> Sl
Dl = dist(sl, Sl)=root-
sum-sqr(diff)
Wj = dmax – dj
Advantages of
decision trees
include:
● computational
scalability
● handling of
messy data missing
values, various
feature types

● ability to deal
with irrelevant
features the
algorithm selects 

“relevant” features
first, and generally
ignores irrelevant
features.
● If the decision
tree is short, it is
easy for a human to
interpret it:
decision trees do
not produce a black
box model.
Algo
Linear
Regression :
(Eager Learner)
Model a linear
relationship
between a
dependent
variable (y) and
independent
variables (x1,x2..)
Regression, as a
term, stems from
the observation
that individual
instances of any
observed
attribute tend to
regress towards
the mean.
Description
Classification :
Scalar input ,
Cont. output
Vector input,
Cont. outputp
** Vector Input ->
combinations of
multiple features
into a single
feature
Preference Bias
Regress to mean
Gradient :
* for one variable
derivative is slope
of tangent line
* for several
variables, gradient
is the direction of
the fastest
increase of
function
Learning Function
y^ = θ0 + θ1x1 + θ2x2
yi = observed value
minimize the Sum
of Squared Error :
½ Sum (y^-yi)2
θ1 = θ0 - α∇J(θ)
θ1 ->next pos
θ0 ->current pos
α is the learning rate
so that function
takes small step
towards the
direction opposite to
that of ∇J (direction
of fastest increase)
Performance
Cons:
Function should
be differentiable
Caution :
Learning rate
must not be very
small or very
large
Enhancement Usage
Housing Price
prediction
Polynomial
Regression
Algo
Multi-Layer
Perceptron
(Eager Learner)
Description
Classification
Preference Bias
Initial weights
should be chosen
to be small and
random values:
— local minima
— variability and
low complexity
(larger weights
equate to larger
complexity).
Learning Function
Perceptron is a
linear function that
offers a hyperplane
in n dimensions,
perpendicular to the
vector w = (w
1
,
w
2
, . . . , w
n
) . The
perceptron classifies
things on one side
of the hyperplane as
positive and things
on the other side as
negative.
Perceptron
Guarantee finite
convergence,
however, only if
linearly separable.
Δwi=η(y−y^)xi
Gradient Descent
Calculus-based
More robust to data
sets that are not
linearly separable,
however, converges
to local minima /
optima.
Δwi=η(y−a)xi
Performance
Neural networks
have low
restriction bias,
because they can
model many
different
functions.
Therefore they
have the danger
of overfitting.
Neural Networks
consist of:
Perceptron:
half-spaces
Sigmoids
(instead of step
functions): much
more complex
Hidden Layers
(groups of
sigmoid
functions)
So it allows for
modeling many
types of
functions /
behaviors, such
as:
Boolean:
network of
threshold-like
units
Continuous:
through hidden
layers (e.g. use
of sigmoids
instead of step)
Arbitrary (non-
continuous):
multiple hidden
layers
Enhancement
Addition of hidden layers
help map continuous
functions (change in input
changes output very
smoothly)
Multiply weights only if
we don’t get better
errors !
Usage
One obvious
advantage of
artificial neural
networks - ability
to produce any
number of
outputs, (multi-
class) while
support vector
machines have
only one. The
most direct way to
create an n-ary
classifier with
support vector
machines is to
create n support
vector machines
and train each of
them one by one.
On the other
hand, an n-ary
classifier with
neural networks
can be trained in
one go.
===========
Multi-layer
perceptron is
able to find
relation between
features. For
example it is
necessary in
computer vision
when a raw image
is provided to the
learning algorithm
and now
Sophisticated
features are
calculated.
Essentially the
intermediate
levels can
calculate new
unknown features.
Algo
K Nearest
Neighbors -
Classification
remembers
mapping, fast
lookup
Preference Bias :
Why consider
KNN over other ?
* near points are
similar to one
another (locality)
* smoothly
changing behavior
from one
neighborhood to
another
neighborhood.
* so we can
choose best
distance function
Learning
Function
Choose best
distance function.
Manhattan: ℓ1
d=∣y2−y1∣+∣x2−x1
∣
Euclidean:
d=sqrt( sqr(y2−y1)
+sqr(x2−x1))
Performance :
Problem : curse
of
dimensionality :
… as the
number of
features grow,
the amount of
data required
for accurate
generalization
grows
exponentially .
> O(2-to-
power-d)
Reducing
weights will
help curb the
effect of
dimensionality.
When k is
small, models
have high bias,
fitting on a
strongly local
level. Larger k
creates models
with lower bias
but higher
variance.
Cons :
* KNN doesn't
know which
attributes are
more
important
* Doesn't
handle
missing data
gracefully
Enhancements :
generalization - NO
overfitting - YES
Usage
No assumption
about data
distribution
(Great Advantage
over NB)
Its highly non-
parametric
Algo
K Nearest
Neighbors -
Regression.
LWR (locally
weighted
regression)
Learning Function
It combines the
traditional
regression with
instance based
learning’s
sensitivity to
training items with
high similarity to
the test point
Performance :
-- reduce the pull
effect of far-
away points
through Kernels
-- the squared
deviations are
weighted by a
kernel function
that decreases
with distance,
such that for a
new test
instance, a
regression
function is
found for that
specific point
that
emphasizes
fitting closeby
points and
ignoring the
pull of faraway
points…
Preference Bias :
- Individual rule
(result of learning
over a subset of
data) does not
provide answer
but when
combined , the
complex rule
works well .
Choose those
examples - where
it offers better
performance on
testing subsets of
data than fitting a
4th order
polynomial
Learning Function
PrD[h(x) <> c(x)]
** boost up the
distribution ….
h1 h2 h3
x1 +1 -1 +1
x2 -1 -1 +1
x3 +1 -1 +1
** find hypothesis
at each time-step Ht
with small error ,
(Weak Classifier)
constantly creating
new distributions …
(Boosting)
** Final
Hypothesis : sgn
(sign) function of
the weighted sum
of all of the rules.
Performance :
Why Boosting
does so well ?
>> if there are
some samples
which do not
provide good
result, then
boosting can re-
rate the samples
so that some of
‘past under-
performers’
become more
important.
>>
Use Grad Boost
to handle noisy
data in DTree :
https://
en.wikipedia.org
/wiki/
Gradient_boostin
g
>> Boosting
does overfit if
Weak Learners
uses NN with
many layers of
nodes
Choosing
Subsets:
Instead of
selecting subsets
randomly, we
can pick subsets
containing
hardest examples
—those
examples that
don’t perform
well given
current rule.
Combine:
Instead of a
mean, consider a
weighted mean.
Enhancements:
● Computationally
efficient.
● No difficult
parameters to set.
● Versatile a wide
range of base learners
can be used with
AdaBoost.Caveats:
● Algorithm seems
susceptible to uniform
noise.
● Weak learner should
not be too complex to
avoid overfitting.
● There needs to be
enough data so that the
weak learning 

requirement is satisfied
the base learner should
perform consistently
better than random
guessing, with
generalization error <
0.5 for binary
classification problems.
usage
body: contains
word manly →
YES
from: your spouse
→ NO
body short length
→ YES
body: only
contains urls →
YES
body: just an
image → YES
body: contains
words belonging
to blacklist
(misspellings) →
YES
All of these rules
are useful,
however, no
specific one can
determine spam
(or not) on its
own. We need to
find a way to
combine them.
find which Wiki
pages can
recommended for
extended period
of time (feature
set a combination
of binary , text ,
nemerics)
Ref : http://
statweb.stanford.e
du/~tibs/
ElemStatLearn/
http://
media.nips.cc/
Conferences/
2007/Tutorials/
Slides/schapire-
NIPS-07-
tutorial.pdf
************
If you have dense
feature set, go
with boosting.
Algo
Ensemble
Learning
*
*
Solves
Classification
Problem.
*************
Boosting is a
meta-learning
technique, i.e.
something you
put on top of a
set of learners to
form an
ensemble


Notes on Ensemble Learning (Boosting)
Important difference of Ensemble Learners from other types of Learners :
-- NN already knows the Network and tries to learn the weights
-- DTree gradually builds the rules
But, Ensemble Learner finds the best combination of rules .
Notes on Association Rule Mining
Preference Bias :
Support : the goal
with the support
vector machine is
to maximize the
margin, m, subject
to the constraint
that we classify
everything
correctly.
Together, this can
be defined
mathematically as:
max(m):yi(wTXi+
b)≥1∀i
Learning Function
Find the line of
least commitment
in the linear
separable set of
data, is the basis
behind support
vector machines
>> a line that
leaves as much
space as possible
from the
boundaries.
y = (wTxj + b)
where: y is the
classification label
and y∈{−1,+1} with
{in classout of
classfor y>0for y<0
wT and b are the
parameters of the
plane
Performance :
>> : similar to
KNN , but here
instead of being
completely lazy ,
spend upfront
efforts to do
complicated
quadratic
programs to
consider
required points .
>> For
classification
tasks involving
more than two
groups, a
common strategy
is to use multiple
binary classifiers
to decide on a
singlebest class
for new
instances
Enhancements:
y = w phi(x) +b
— use Kernel when
feature vector phi is of
higher dimension.
Many machine learning
algorithms can be
written to only use dot
products, and then we
can replace the dot
products with kernels
usage
Mostly binary
classification
(linear and non-
linear)
1) If you have
sparse feature
set, go with
linear svm (or
other linear
model)
2) If you don't
care about speed
and memory, try
kernel svm.
*************
In order to
eliminated
expensive
parameter tuning
and better handle
high-dimensional
input space —>
we can use
Kernelized SVM
for text
classification (tens
of thousands of
support vectors,
each having
hundreds of
thousands of
features)
Algo
SVM
The classifier is
greater than or
equal to 1 for the
positive examples
and less than or
equal to -1 for the
negative examples
….….
…… difference
between the
vector x
1
and the
vector x
2
projected
*
Classification
1. Initialize the importance weights w
i
= 1/N for all training examples i. 2. For m = 1 to M:
a) Fit a classifier G
m
(x) to the training data using the weights w
i
.
b) Compute the error: err
m
=

∑ w
i
I(y
i
=/ G
m
(x
i
)) / ∑ w
i
c) Compute α
m
= log((1 − err
m
)/err
m
)
d) Update weights: w
i
← w
i
. exp[α
m
. I(y
i
=/ G
m
(x
i
))] for i = 1, 2, ... N
3. Return G(x) = sign[ ∑ α
m
G
m
(x)].
We can see that for error < 0.5, the α
m
parameter is positive
Notes on Support Vector Machines - SVM
>>>
Here instead of Polynomial Regression we consider Polynomial Kernel kernel represents domain
knowledge
=> projecting into some higher dimensional space.
For data that is separable, but not linearly, we can use a kernel function to capture a nonlinear dividing
curve. The kernel function should capture some aspect of similarity in our data.
Kernel Machines :
Do not remember the entire populations (positive set / negative set)
Just remember the instances supporting the boundary …. works well for
Recommender System.
Ref : https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM
Simple Example of Kernel : x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3,
x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.
Let's plug in some numbers to make this more intuitive:
suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 =
1024
A lot of algebra. as f is a mapping from 3-dimensional to 9 dimensional space.
Now let us use the kernel instead:
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024 . Same result, but this calculation is so much easier.
Notes on Apriori
http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab8-Apriori.pdf
https://youtu.be/4J3gX4ySw1s?t=10
The problem of association rule mining is defined as:
Let I = {i1, i2, ..., in} be a set of n binary attributes called items. Let D = {t1, t2, ..., tn} be a set of
transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of
the items in I. A rule is defined as an implication of the form X→Y where X, Y ⊆ I and ∩ = ∅.
Lets use a small example from the supermarket domain. The set of items is I = {milk,bread,butter,beer}
Transaction ID milk Bread butter beer
1 1 1 0 0
2 0 1 1 0
………………………….
supp(X)= no. of transactions which contain the itemset X / total no. of transactions
say the itemset {milk,bread,butter} has a support of 4 /15 = 0.26 ….
conf(X->Y) = supp(X U Y) / supp(X)
For the rule {milk,bread}=>{butter} we have the following confidence:
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65
Types of Errors
In sample error => error resulted from applying the prediction algorithm to the training dataset
Out of sample error => error resulted from applying the prediction algorithm to a new test data set
In sample error < Out of sample error => model is overfitting i.e. model is too optimized for the initial
dataset
Regression Errors:
Bias-Variance Estimates
Its very important to calculate ‘Bias Errors’ and ‘Variance Errors’ while comparing various algorithms.
Error due to Bias => when a prediction model is built multiple times then Bias Error is the difference between
‘Expected Prediction value’ and Correct value. Bias provides a deviation of prediction ranges from real values .
Example of low bias ==> tendency of mean of all the sample points to converge towards mean of real values
*
Error due to Variance => how much the predictions for a given point vary between different implementations of the
model.
Example of high variability ==> sample points tend to be dispersed away from each other.
Reference : http://scott.fortmann-roe.com/docs/BiasVariance.html
so often it is better to give up a little accuracy for more robustness when predicting on new data.
Classification Errors:
Positive = identified and Negative = rejected
True positive = correctly identified (predicted true when true)
False positive = incorrectly identified (predicted true when false)
True negative = correctly rejected (predicted false when false)
False negative = incorrectly rejected (predicted false when true)
example: medical testing
True positive = Sick people correctly diagnosed as sick
False positive = Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy
Reference : https://www.youtube.com/watch?v=cYls8WVZfyc
k= accuracy−P(e) / 1−P(e)
P(e)=(TP+FP / total) × (TP+FN / total) + (TN+FN / total) × (FP+TN/ total)
What customer cares about is - Type-1 (FP) Errors and Type-2 (TP) Errors
Hyper parameter Optimization :
Choose a Regularization lambda that increases performance and decreases loss
Use Bayesian Optimization instead of Grid Search
Receiver Operating Characteristic curves :
>> demonstrates predictive power of model across various thresholds
cons : (i) class imbalance , (ii) ignores constraint costs of FP vs. FN
x-axis = 1 - specificity (or, probability of false positive)
y-axis = sensitivity (or, probability of true positive)
areas under curve = quantifies whether the prediction model is viable or not
i.e. higher area →→ better predictor
area = 0.5 →→
effectively
random guessing
(diagonal line in
the ROC curve)
area = 1 →→
perfect classifier
area = 0.8 →→
considered good
for a prediction
algorithm
Choose cross
entropy function
as the Logistic
Loss
loss= −∑
i
yi.logy_predictedi + 1−yi .log1−y_predictedi
So as we see if predicted prob close to 0 for a ‘yes’ example OR pred close to 1 for a ‘no’ sample ;
then Loss Value ~ -log(~0) ~ +ve Infinity —> outputs a very large value (high penalty)
Precision and Recall curve solves the problem of imbalance !
Lets see which AUC curve makes more business sense :
As we see the orange curve (with highest AUC) doesn’t satisfy the constraint and isn’t profitable !
Now that we have a model which is economically viable , so lets pick the right classifier threshold !
Model Optimization Strategies :
Besides cross validation , hyper parameter tuning tie-up Long-term Business Metrics with the Response Variable.
Know that selecting wrong model and penalty of misclassification has an economic impact (say Money Transaction /
Credit Allocation)
ML Algo can inherently hide issues :
— if features implemented incorrectly
— random variations in production data
So Interpretation and Evaluation of Model is very important.
First Interpret Models
> understand how Features contribute to Model Predictions (build confidence)
> explain Individual Predictions
> evaluate Consistency and Coherence of the Model
ML-Insights ( Python Package)
> quick Feature Dependence Plots (ICE-plots) for model exploration / understanding
> Feature Effect Summary -
> Explain Feature Prediction (given 2 points explain why model performed better prediction for one point over another)
detailed reference : http://www.slideshare.net/SessionsEvents/brian-lucena-senior-data-scientist-metis-at-mlconf-
sf-2016
https://www.youtube.com/watch?v=Vv157uQrgg4&list=PLrbAIdPI69Pi88waiIv8gZ3agEU_hBaVM&index=9
excerpts :
** here we see if Glucose Value is specified to a fixed value (keeping other variables constant) then observe how RISK
factor changes for different patients
code ~ https://github.com/numeristical/introspective/tree/master/examples
Next Evaluate Model performances
— always include Costs in Model Selection
— always Review Model Evaluation metrics
Optimization Algorithms - adopts local methods
— Stochastic Gradient Descent , Conjugate Gradient
— Embarrassingly parallel
— Stuck up in local minim
Mathematical optimization
— can find global optimum
— nicely handles constraints (L0 norm)
Examples of Mathematical Optimization Models used in ML Algo
>> Linear Models : LASSO , Ridge Classifier, Elastic Net, Hinge Loss
>> SVM : Primal , Dual Linear
>> Decision Forests : Decision tree Vote
>> Alternating Least Squares : Application to Collaborative Filtering
Ref : http://www.slideshare.net/SessionsEvents/jeanfranois-puget-distinguished-engineer-machine-learning-and-
optimization-ibm-at-mlconf-sf-2016
Interesting Documents :
Causal Analysis of Observational Data https://www.youtube.com/watch?
v=X2j6QT4UDSs&list=PLrbAIdPI69Pi88waiIv8gZ3agEU_hBaVM&index=12
References :
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-
notes/
http://www.stat.cmu.edu/~cshalizi/350/
http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why
https://www.udacity.com/course/machine-learning--ud262
https://www.coursera.org/learn/machine-learning
http://sux13.github.io/DataScienceSpCourseNotes/8_PREDMACHLEARN/
Practical_Machine_Learning_Course_Notes.html
https://www.youtube.com/watch?v=oxWruJZ-BbU
Ad

More Related Content

What's hot (20)

Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac12
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
ananth
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
butest
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Agile Testing Alliance
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
Ankit Tewari
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
ananth
 
ppt
pptppt
ppt
butest
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
ananth
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
butest
 
Function Approx2009
Function Approx2009Function Approx2009
Function Approx2009
Imthias Ahamed
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
Xueping Peng
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_geman
zukun
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
butest
 
Machine learning
Machine learningMachine learning
Machine learning
Sukhwinder Singh
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
swapnac12
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
ananth
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
butest
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Agile Testing Alliance
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
Ankit Tewari
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
ananth
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
ananth
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
butest
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
recsysfr
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
Xueping Peng
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_geman
zukun
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
butest
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 

Viewers also liked (20)

TPM
TPMTPM
TPM
Abu Xafar Ansari Bin Kayes Akik
 
C.V Saleh Hijjah
C.V Saleh HijjahC.V Saleh Hijjah
C.V Saleh Hijjah
Saleh Hijjah
 
Herramientas de software
Herramientas de softwareHerramientas de software
Herramientas de software
Johana Garcia
 
LEAN Manufacturing
LEAN ManufacturingLEAN Manufacturing
LEAN Manufacturing
Abu Xafar Ansari Bin Kayes Akik
 
Kierunki rozwoju BPM
Kierunki rozwoju BPMKierunki rozwoju BPM
Kierunki rozwoju BPM
Tomasz Gzik
 
Surviving the blackout
Surviving the blackoutSurviving the blackout
Surviving the blackout
David Morley
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
butest
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
Colleen Farrelly
 
Indexes don't mean slow inserts.
Indexes don't mean slow inserts.Indexes don't mean slow inserts.
Indexes don't mean slow inserts.
Anastasia Lubennikova
 
Predictive Analytics Glossary
Predictive Analytics GlossaryPredictive Analytics Glossary
Predictive Analytics Glossary
Algolytics
 
Global positioning system (gps)
Global positioning system (gps)Global positioning system (gps)
Global positioning system (gps)
aditya singh
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
Direct steam generation from solar
Direct steam generation from solarDirect steam generation from solar
Direct steam generation from solar
Akshay ss kumar
 
Educación y género
Educación y género Educación y género
Educación y género
RACHELMAYRA
 
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Sri Ambati
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu
 
Machine learning-cheat-sheet
Machine learning-cheat-sheetMachine learning-cheat-sheet
Machine learning-cheat-sheet
Willy Marroquin (WillyDevNET)
 
How to hack into the big data team
How to hack into the big data teamHow to hack into the big data team
How to hack into the big data team
Data Science Thailand
 
Foaming capacity of different soaps
Foaming capacity of different soapsFoaming capacity of different soaps
Foaming capacity of different soaps
aditya singh
 
Herramientas de software
Herramientas de softwareHerramientas de software
Herramientas de software
Johana Garcia
 
Kierunki rozwoju BPM
Kierunki rozwoju BPMKierunki rozwoju BPM
Kierunki rozwoju BPM
Tomasz Gzik
 
Surviving the blackout
Surviving the blackoutSurviving the blackout
Surviving the blackout
David Morley
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
butest
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
Colleen Farrelly
 
Predictive Analytics Glossary
Predictive Analytics GlossaryPredictive Analytics Glossary
Predictive Analytics Glossary
Algolytics
 
Global positioning system (gps)
Global positioning system (gps)Global positioning system (gps)
Global positioning system (gps)
aditya singh
 
Direct steam generation from solar
Direct steam generation from solarDirect steam generation from solar
Direct steam generation from solar
Akshay ss kumar
 
Educación y género
Educación y género Educación y género
Educación y género
RACHELMAYRA
 
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Sri Ambati
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu
 
Foaming capacity of different soaps
Foaming capacity of different soapsFoaming capacity of different soaps
Foaming capacity of different soaps
aditya singh
 
Ad

Similar to MS CS - Selecting Machine Learning Algorithm (20)

Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
Kaniska Mandal
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
Image Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptxImage Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptx
ssuseracb8ba
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
yang947066
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Decision tree
Decision treeDecision tree
Decision tree
Soujanya V
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
EmanAl15
 
Machine learning
Machine learningMachine learning
Machine learning
Digvijay Singh
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
Dr. Radhey Shyam
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
MohamedSaied316569
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 
Summary.ppt
Summary.pptSummary.ppt
Summary.ppt
butest
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
Kaniska Mandal
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
Image Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptxImage Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptx
ssuseracb8ba
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
EmanAl15
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
butest
 
Summary.ppt
Summary.pptSummary.ppt
Summary.ppt
butest
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Ad

More from Kaniska Mandal (20)

Machine learning advanced applications
Machine learning advanced applicationsMachine learning advanced applications
Machine learning advanced applications
Kaniska Mandal
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and http
Kaniska Mandal
 
Designing Better API
Designing Better APIDesigning Better API
Designing Better API
Kaniska Mandal
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
Kaniska Mandal
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of Modelling
Kaniska Mandal
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.Odt
Kaniska Mandal
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class Loader
Kaniska Mandal
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In Eclipse
Kaniska Mandal
 
Eclipse Tricks
Eclipse TricksEclipse Tricks
Eclipse Tricks
Kaniska Mandal
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super Force
Kaniska Mandal
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD Framework
Kaniska Mandal
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using Dltk
Kaniska Mandal
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate Notes
Kaniska Mandal
 
Best Of Jdk 7
Best Of Jdk 7Best Of Jdk 7
Best Of Jdk 7
Kaniska Mandal
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml Classes
Kaniska Mandal
 
EMF Tips n Tricks
EMF Tips n TricksEMF Tips n Tricks
EMF Tips n Tricks
Kaniska Mandal
 
Graphical Model Transformation Framework
Graphical Model Transformation FrameworkGraphical Model Transformation Framework
Graphical Model Transformation Framework
Kaniska Mandal
 
Mashup Magic
Mashup MagicMashup Magic
Mashup Magic
Kaniska Mandal
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
Kaniska Mandal
 
Machine learning advanced applications
Machine learning advanced applicationsMachine learning advanced applications
Machine learning advanced applications
Kaniska Mandal
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and http
Kaniska Mandal
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
Kaniska Mandal
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of Modelling
Kaniska Mandal
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.Odt
Kaniska Mandal
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class Loader
Kaniska Mandal
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In Eclipse
Kaniska Mandal
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super Force
Kaniska Mandal
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD Framework
Kaniska Mandal
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using Dltk
Kaniska Mandal
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate Notes
Kaniska Mandal
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml Classes
Kaniska Mandal
 
Graphical Model Transformation Framework
Graphical Model Transformation FrameworkGraphical Model Transformation Framework
Graphical Model Transformation Framework
Kaniska Mandal
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
Kaniska Mandal
 

Recently uploaded (20)

Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 

MS CS - Selecting Machine Learning Algorithm

  • 1. Machine Learning Concepts Machine Learning has various types tribes: — Symbolists discover new Knowledge by filling the missing info i.e. predict the categories (logical or numerical) through math programs (Inverse Deduction) — Tom Mitchel [ the Biologist Robot ] — Evolutionary Biologist starts with basic knowledge , formulates hypothesis using Inverse Deduction, designs Models and runs them. They simulate the evolution and performs ‘structure discovery’ through Genetic Programming Robots live in Simulated world, evolves Robots (in each generation fit tests orbit gets a chance to 3D print the next Robot) — Connectionists (Neuroscientists) emulate the brains using Backpropagation — Geoff Hintonn It best solves Credit Assignment problem (correct the credits) using Backpropagation ‘Google Cat Network’ learnt cats from Youtube Videos. — Bayesians systematically reduce uncertainty using (Probablistic Inference) It best at predicting chance of happening based on more evidence say based on the email as Evidence, find prob of 2 hypothesis (H1 - spam, H2 - not spam) — Analogizers detect similarities between past and present through (reasoning by analogy) using Kernel Machines It learns from Similarity (needs much less data as they can generalize lots of data)
  • 2. Well that was an interesting perspective on Machine Learning. Lets now focus on the most common goal of an ML Algo. ML Algorithms usually solve an optimization problem such that - we need to find parameters for a given model that minimizes — Loss function (prediction error) — Model simplicity (regularization) A Concept is a function or mapping from objects to membership . A mapping between objects in the world and membership in a set. An Instance is a Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an income) A Target Concept is the actual answer thats being searched in the space of multiple concepts. A Hypothesis: helps to predict target concept (actual answer) — we apply candidate concepts to testing set (should include lots of examples) — we apply inductive learning to choose a hypothesis from given set of examples We need to ask some relevant questions to choose a a Hypothesis ! What’s the Inductive Bias for the Classification Function ? >> Inductive Bias helps us find a General Rule from example. >> Generalization is the whole point in Machine Learning Whats the Occum’s Razor ? >> Prefer simplest hypothesis that fits data What’s the Restriction Bias ? >> Consider only those hypothesis which can be represented by chosen algorithm Supervised classification => Function Approximation : predicting outcome when we know the different classifications example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different classifications. example: splitting all data for sepal width/length into different groups (cluster similar data together) Reinforcement Learning => Learning from Delayed Reward. Eager & Lazy Learners : Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets find a function that best fits training data i.e. spend time to learn from data , when new inputs are received the input features are fed into the function here we consider global scale inputs and avoid local sensitivities
  • 3. Lazy Learners : lazy learners do not compute a function to fit the training data before new data is received so we save significant time upfront new instances are compared to the training data to make a classification / regression decision !!! considers local-scale estimation . MLAlgo Preference Bias Learning Function Performance Enhancements Usage Bayesian (Eager Learner) - Classification Prior Domain Knowledge ~ Pr (h) prior prob for each candidate h ~ Pr(D) – prob dist. Over observed data for each h Occum’s Razor ? - select h with min length ** at least one maximally probable hypothesis hmap -> argmaxP(h|D) -> argmaxP(D|h) (for uniform prior) Posterior Prob P(h|D) = P(D|h).P(h) / P(D) Key assumption : every hi equally probably a priori => p(hi) = p(hj) * Noise Free Uniformly Dist. Hypothesis in V(s) * P(h) = 1 / |H| , P(D|h) = { 1 if di = h(x) , 0 otherwise } P(h|D) = 1 / |V(s)| * Noisy Data* di = k.xi hmc = argmax P (h| D) = argmax P(D|h) = argmax π P(di|h) * di = f(xi) + e ln (hmc) = argmin Sum (di – hi(x))2 * Vmap = argmaxv Sumh P(v|h).P(h|D) Cons : * significant computational cost to find Bayes optimal hypothesis * sometimes huge no of hypothesis need to be surveyed . * NB handles missing data very well: it just excludes the attribute with missing data when computing posterior probability (i.e. probability of class given data point) Pros : No need to be aware of given hypothesis — for smaller training set, NB is good bet ! * Use Bayesian Learning to represent Conditional Independence of variables ! * Assumes real- valued attributes are normally distributed. As a result, NB can only have linear, elliptic, or parabolic decision boundaries. * Example: misclassificatio n , pruning , fitting errors * spam / | Lottery Bank College P(spam | lottery , not bank , not college) = p(vi). Πi P (ai | v)
  • 4. Algo Decision Tree : (Eager Learner) ID3 , C4.5 approximate discrete values functions disjunction of conjunction of constraints on attr values Description Classification : for discrete input data : for cont. input data (consider Range selection as condition - >20% ) Preference Bias Occum’s Razor ? : shorter tree Other Biases : : prefer attributes with many possible values : prefer trees that places high info gain attrs close to root (attr with best answers NOT best splits) Learning Function Info Gain (S,A) = Entropy(S) – Sumv (|Sv| / | S|)*Entropy(Sv) ** wtd sum of entropies of partitions * Entropy(s) = -Sum(Pv log(Pv)) Performance Usual problem of Dtree : for N variables 2N combinations of rows (2)2-to-the-power-N outputs ** so instead of iterating on all rows , first work upon only the attributes which have highest info gain. ** handles noise , handles missing values ============ = Scope of improvement : Decision trees, however, often achieve lower generalization accuracy, compared to other learning methods, such as support vector machines and neural networks. One common way to improve their accuracy is boosting Enhancement pros : computes best attribute in one move cons : * does not look ahead or behind ( this problem is solved by Hill- Climbing …) * tends to overfit as it looks into many diferent combinations of features * logistic regression avoids overfitting more elegantly ** Overfitting soln for DTree : >> stop growing tree before it grows too large >> prune after certain threshold * consider interdependency betn attributes P(Y=y | X=x) * consider GainRatio , SplitInfo Usage - restaurant selection decision based on cost, menu , appetite, weather, and other features. - Decision Tree : Regression Classification : for cont. output data Lazy Distance- based learning func : For each training sample sl -> Sl Dl = dist(sl, Sl)=root- sum-sqr(diff) Wj = dmax – dj Advantages of decision trees include: ● computational scalability ● handling of messy data missing values, various feature types
 ● ability to deal with irrelevant features the algorithm selects 
 “relevant” features first, and generally ignores irrelevant features. ● If the decision tree is short, it is easy for a human to interpret it: decision trees do not produce a black box model.
  • 5. Algo Linear Regression : (Eager Learner) Model a linear relationship between a dependent variable (y) and independent variables (x1,x2..) Regression, as a term, stems from the observation that individual instances of any observed attribute tend to regress towards the mean. Description Classification : Scalar input , Cont. output Vector input, Cont. outputp ** Vector Input -> combinations of multiple features into a single feature Preference Bias Regress to mean Gradient : * for one variable derivative is slope of tangent line * for several variables, gradient is the direction of the fastest increase of function Learning Function y^ = θ0 + θ1x1 + θ2x2 yi = observed value minimize the Sum of Squared Error : ½ Sum (y^-yi)2 θ1 = θ0 - α∇J(θ) θ1 ->next pos θ0 ->current pos α is the learning rate so that function takes small step towards the direction opposite to that of ∇J (direction of fastest increase) Performance Cons: Function should be differentiable Caution : Learning rate must not be very small or very large Enhancement Usage Housing Price prediction Polynomial Regression
  • 6. Algo Multi-Layer Perceptron (Eager Learner) Description Classification Preference Bias Initial weights should be chosen to be small and random values: — local minima — variability and low complexity (larger weights equate to larger complexity). Learning Function Perceptron is a linear function that offers a hyperplane in n dimensions, perpendicular to the vector w = (w 1 , w 2 , . . . , w n ) . The perceptron classifies things on one side of the hyperplane as positive and things on the other side as negative. Perceptron Guarantee finite convergence, however, only if linearly separable. Δwi=η(y−y^)xi Gradient Descent Calculus-based More robust to data sets that are not linearly separable, however, converges to local minima / optima. Δwi=η(y−a)xi Performance Neural networks have low restriction bias, because they can model many different functions. Therefore they have the danger of overfitting. Neural Networks consist of: Perceptron: half-spaces Sigmoids (instead of step functions): much more complex Hidden Layers (groups of sigmoid functions) So it allows for modeling many types of functions / behaviors, such as: Boolean: network of threshold-like units Continuous: through hidden layers (e.g. use of sigmoids instead of step) Arbitrary (non- continuous): multiple hidden layers Enhancement Addition of hidden layers help map continuous functions (change in input changes output very smoothly) Multiply weights only if we don’t get better errors ! Usage One obvious advantage of artificial neural networks - ability to produce any number of outputs, (multi- class) while support vector machines have only one. The most direct way to create an n-ary classifier with support vector machines is to create n support vector machines and train each of them one by one. On the other hand, an n-ary classifier with neural networks can be trained in one go. =========== Multi-layer perceptron is able to find relation between features. For example it is necessary in computer vision when a raw image is provided to the learning algorithm and now Sophisticated features are calculated. Essentially the intermediate levels can calculate new unknown features.
  • 7. Algo K Nearest Neighbors - Classification remembers mapping, fast lookup Preference Bias : Why consider KNN over other ? * near points are similar to one another (locality) * smoothly changing behavior from one neighborhood to another neighborhood. * so we can choose best distance function Learning Function Choose best distance function. Manhattan: ℓ1 d=∣y2−y1∣+∣x2−x1 ∣ Euclidean: d=sqrt( sqr(y2−y1) +sqr(x2−x1)) Performance : Problem : curse of dimensionality : … as the number of features grow, the amount of data required for accurate generalization grows exponentially . > O(2-to- power-d) Reducing weights will help curb the effect of dimensionality. When k is small, models have high bias, fitting on a strongly local level. Larger k creates models with lower bias but higher variance. Cons : * KNN doesn't know which attributes are more important * Doesn't handle missing data gracefully Enhancements : generalization - NO overfitting - YES Usage No assumption about data distribution (Great Advantage over NB) Its highly non- parametric
  • 8. Algo K Nearest Neighbors - Regression. LWR (locally weighted regression) Learning Function It combines the traditional regression with instance based learning’s sensitivity to training items with high similarity to the test point Performance : -- reduce the pull effect of far- away points through Kernels -- the squared deviations are weighted by a kernel function that decreases with distance, such that for a new test instance, a regression function is found for that specific point that emphasizes fitting closeby points and ignoring the pull of faraway points…
  • 9. Preference Bias : - Individual rule (result of learning over a subset of data) does not provide answer but when combined , the complex rule works well . Choose those examples - where it offers better performance on testing subsets of data than fitting a 4th order polynomial Learning Function PrD[h(x) <> c(x)] ** boost up the distribution …. h1 h2 h3 x1 +1 -1 +1 x2 -1 -1 +1 x3 +1 -1 +1 ** find hypothesis at each time-step Ht with small error , (Weak Classifier) constantly creating new distributions … (Boosting) ** Final Hypothesis : sgn (sign) function of the weighted sum of all of the rules. Performance : Why Boosting does so well ? >> if there are some samples which do not provide good result, then boosting can re- rate the samples so that some of ‘past under- performers’ become more important. >> Use Grad Boost to handle noisy data in DTree : https:// en.wikipedia.org /wiki/ Gradient_boostin g >> Boosting does overfit if Weak Learners uses NN with many layers of nodes Choosing Subsets: Instead of selecting subsets randomly, we can pick subsets containing hardest examples —those examples that don’t perform well given current rule. Combine: Instead of a mean, consider a weighted mean. Enhancements: ● Computationally efficient. ● No difficult parameters to set. ● Versatile a wide range of base learners can be used with AdaBoost.Caveats: ● Algorithm seems susceptible to uniform noise. ● Weak learner should not be too complex to avoid overfitting. ● There needs to be enough data so that the weak learning 
 requirement is satisfied the base learner should perform consistently better than random guessing, with generalization error < 0.5 for binary classification problems. usage body: contains word manly → YES from: your spouse → NO body short length → YES body: only contains urls → YES body: just an image → YES body: contains words belonging to blacklist (misspellings) → YES All of these rules are useful, however, no specific one can determine spam (or not) on its own. We need to find a way to combine them. find which Wiki pages can recommended for extended period of time (feature set a combination of binary , text , nemerics) Ref : http:// statweb.stanford.e du/~tibs/ ElemStatLearn/ http:// media.nips.cc/ Conferences/ 2007/Tutorials/ Slides/schapire- NIPS-07- tutorial.pdf ************ If you have dense feature set, go with boosting. Algo Ensemble Learning * * Solves Classification Problem. ************* Boosting is a meta-learning technique, i.e. something you put on top of a set of learners to form an ensemble
  • 10. 
 Notes on Ensemble Learning (Boosting) Important difference of Ensemble Learners from other types of Learners : -- NN already knows the Network and tries to learn the weights -- DTree gradually builds the rules But, Ensemble Learner finds the best combination of rules . Notes on Association Rule Mining Preference Bias : Support : the goal with the support vector machine is to maximize the margin, m, subject to the constraint that we classify everything correctly. Together, this can be defined mathematically as: max(m):yi(wTXi+ b)≥1∀i Learning Function Find the line of least commitment in the linear separable set of data, is the basis behind support vector machines >> a line that leaves as much space as possible from the boundaries. y = (wTxj + b) where: y is the classification label and y∈{−1,+1} with {in classout of classfor y>0for y<0 wT and b are the parameters of the plane Performance : >> : similar to KNN , but here instead of being completely lazy , spend upfront efforts to do complicated quadratic programs to consider required points . >> For classification tasks involving more than two groups, a common strategy is to use multiple binary classifiers to decide on a singlebest class for new instances Enhancements: y = w phi(x) +b — use Kernel when feature vector phi is of higher dimension. Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels usage Mostly binary classification (linear and non- linear) 1) If you have sparse feature set, go with linear svm (or other linear model) 2) If you don't care about speed and memory, try kernel svm. ************* In order to eliminated expensive parameter tuning and better handle high-dimensional input space —> we can use Kernelized SVM for text classification (tens of thousands of support vectors, each having hundreds of thousands of features) Algo SVM The classifier is greater than or equal to 1 for the positive examples and less than or equal to -1 for the negative examples ….…. …… difference between the vector x 1 and the vector x 2 projected * Classification
  • 11. 1. Initialize the importance weights w i = 1/N for all training examples i. 2. For m = 1 to M: a) Fit a classifier G m (x) to the training data using the weights w i . b) Compute the error: err m =
 ∑ w i I(y i =/ G m (x i )) / ∑ w i c) Compute α m = log((1 − err m )/err m ) d) Update weights: w i ← w i . exp[α m . I(y i =/ G m (x i ))] for i = 1, 2, ... N 3. Return G(x) = sign[ ∑ α m G m (x)]. We can see that for error < 0.5, the α m parameter is positive Notes on Support Vector Machines - SVM >>> Here instead of Polynomial Regression we consider Polynomial Kernel kernel represents domain knowledge => projecting into some higher dimensional space. For data that is separable, but not linearly, we can use a kernel function to capture a nonlinear dividing curve. The kernel function should capture some aspect of similarity in our data. Kernel Machines : Do not remember the entire populations (positive set / negative set) Just remember the instances supporting the boundary …. works well for Recommender System.
  • 12. Ref : https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM Simple Example of Kernel : x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2. Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then: f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36) <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024 A lot of algebra. as f is a mapping from 3-dimensional to 9 dimensional space. Now let us use the kernel instead: K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024 . Same result, but this calculation is so much easier. Notes on Apriori http://software.ucv.ro/~cmihaescu/ro/teaching/AIR/docs/Lab8-Apriori.pdf https://youtu.be/4J3gX4ySw1s?t=10 The problem of association rule mining is defined as: Let I = {i1, i2, ..., in} be a set of n binary attributes called items. Let D = {t1, t2, ..., tn} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X→Y where X, Y ⊆ I and ∩ = ∅. Lets use a small example from the supermarket domain. The set of items is I = {milk,bread,butter,beer} Transaction ID milk Bread butter beer 1 1 1 0 0 2 0 1 1 0 …………………………. supp(X)= no. of transactions which contain the itemset X / total no. of transactions say the itemset {milk,bread,butter} has a support of 4 /15 = 0.26 …. conf(X->Y) = supp(X U Y) / supp(X) For the rule {milk,bread}=>{butter} we have the following confidence:
  • 13. supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 / 0.4 = 0.65 Types of Errors In sample error => error resulted from applying the prediction algorithm to the training dataset Out of sample error => error resulted from applying the prediction algorithm to a new test data set In sample error < Out of sample error => model is overfitting i.e. model is too optimized for the initial dataset Regression Errors: Bias-Variance Estimates Its very important to calculate ‘Bias Errors’ and ‘Variance Errors’ while comparing various algorithms. Error due to Bias => when a prediction model is built multiple times then Bias Error is the difference between ‘Expected Prediction value’ and Correct value. Bias provides a deviation of prediction ranges from real values . Example of low bias ==> tendency of mean of all the sample points to converge towards mean of real values * Error due to Variance => how much the predictions for a given point vary between different implementations of the model. Example of high variability ==> sample points tend to be dispersed away from each other. Reference : http://scott.fortmann-roe.com/docs/BiasVariance.html so often it is better to give up a little accuracy for more robustness when predicting on new data.
  • 14. Classification Errors: Positive = identified and Negative = rejected True positive = correctly identified (predicted true when true) False positive = incorrectly identified (predicted true when false) True negative = correctly rejected (predicted false when false) False negative = incorrectly rejected (predicted false when true) example: medical testing True positive = Sick people correctly diagnosed as sick False positive = Healthy people incorrectly identified as sick True negative = Healthy people correctly identified as healthy False negative = Sick people incorrectly identified as healthy Reference : https://www.youtube.com/watch?v=cYls8WVZfyc k= accuracy−P(e) / 1−P(e) P(e)=(TP+FP / total) × (TP+FN / total) + (TN+FN / total) × (FP+TN/ total) What customer cares about is - Type-1 (FP) Errors and Type-2 (TP) Errors Hyper parameter Optimization : Choose a Regularization lambda that increases performance and decreases loss Use Bayesian Optimization instead of Grid Search Receiver Operating Characteristic curves :
  • 15. >> demonstrates predictive power of model across various thresholds cons : (i) class imbalance , (ii) ignores constraint costs of FP vs. FN x-axis = 1 - specificity (or, probability of false positive) y-axis = sensitivity (or, probability of true positive) areas under curve = quantifies whether the prediction model is viable or not i.e. higher area →→ better predictor area = 0.5 →→ effectively random guessing (diagonal line in the ROC curve) area = 1 →→ perfect classifier area = 0.8 →→ considered good for a prediction algorithm Choose cross entropy function as the Logistic Loss loss= −∑ i yi.logy_predictedi + 1−yi .log1−y_predictedi
  • 16. So as we see if predicted prob close to 0 for a ‘yes’ example OR pred close to 1 for a ‘no’ sample ; then Loss Value ~ -log(~0) ~ +ve Infinity —> outputs a very large value (high penalty)
  • 17. Precision and Recall curve solves the problem of imbalance ! Lets see which AUC curve makes more business sense : As we see the orange curve (with highest AUC) doesn’t satisfy the constraint and isn’t profitable !
  • 18. Now that we have a model which is economically viable , so lets pick the right classifier threshold ! Model Optimization Strategies : Besides cross validation , hyper parameter tuning tie-up Long-term Business Metrics with the Response Variable. Know that selecting wrong model and penalty of misclassification has an economic impact (say Money Transaction / Credit Allocation) ML Algo can inherently hide issues : — if features implemented incorrectly — random variations in production data So Interpretation and Evaluation of Model is very important. First Interpret Models > understand how Features contribute to Model Predictions (build confidence) > explain Individual Predictions > evaluate Consistency and Coherence of the Model ML-Insights ( Python Package) > quick Feature Dependence Plots (ICE-plots) for model exploration / understanding > Feature Effect Summary - > Explain Feature Prediction (given 2 points explain why model performed better prediction for one point over another)
  • 19. detailed reference : http://www.slideshare.net/SessionsEvents/brian-lucena-senior-data-scientist-metis-at-mlconf- sf-2016 https://www.youtube.com/watch?v=Vv157uQrgg4&list=PLrbAIdPI69Pi88waiIv8gZ3agEU_hBaVM&index=9 excerpts : ** here we see if Glucose Value is specified to a fixed value (keeping other variables constant) then observe how RISK factor changes for different patients code ~ https://github.com/numeristical/introspective/tree/master/examples Next Evaluate Model performances — always include Costs in Model Selection — always Review Model Evaluation metrics Optimization Algorithms - adopts local methods — Stochastic Gradient Descent , Conjugate Gradient — Embarrassingly parallel — Stuck up in local minim Mathematical optimization — can find global optimum — nicely handles constraints (L0 norm) Examples of Mathematical Optimization Models used in ML Algo >> Linear Models : LASSO , Ridge Classifier, Elastic Net, Hinge Loss >> SVM : Primal , Dual Linear >> Decision Forests : Decision tree Vote >> Alternating Least Squares : Application to Collaborative Filtering Ref : http://www.slideshare.net/SessionsEvents/jeanfranois-puget-distinguished-engineer-machine-learning-and- optimization-ibm-at-mlconf-sf-2016 Interesting Documents : Causal Analysis of Observational Data https://www.youtube.com/watch? v=X2j6QT4UDSs&list=PLrbAIdPI69Pi88waiIv8gZ3agEU_hBaVM&index=12 References :