0% found this document useful (0 votes)

4 views

jpskycak-2018-intuiting-predictive-algorithms-1

Uploaded by

Vishnu Vardhan V

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

jpskycak-2018-intuiting-predictive-algorithms-1

Uploaded by

Vishnu Vardhan V

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Intuiting Predictive Algorithms

Justin Skycak, 2018

The goal of this write-up is to show how various predictive algorithms function and relate to each
other.

1. Naive Bayes 1

2. MAP and MLE 3

3. Linear Regression 5

4. Support Vector Machines 6

5. Neural Networks 9

6. Decision Trees 14

7. Ensemble Methods 15

1. Naive Bayes

If we know the causal structure between variables in our data, we can build a Bayesian network,
which encodes conditional dependencies between variables via a directed acyclic graph. Such a
model is constrained by our human understanding of the relationship between parts of the data,
though, and may not be optimal when we wish to predict a target variable despite knowing little
about the other variables to which it may or may not relate.

That being said, if we know that the target variable is a class that somehow encapsulates the
other variables, it can be worthwhile to try a Bayesian network where the other variables are
assumed to depend conditionally and independently on the class. This is called Naive Bayes
classification because it naively assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature.

1
If the data is given by and each belongs to a class , then the Naive Bayes
classifier computes

For example, we could build a Naive Bayes classifier to predict whether an email is a phishing
attempt based on whether it has spelling errors and links:

2
We could then use our model to test whether a new email is a phishing attempt:

In this example, we used discrete bins for the features -- but Naive Bayes can also handle
features which are fit to continuous distributions. And despite assuming that features are
independent (and thus potentially ignoring a lot of useful information), Naive Bayes can
sometimes perform well enough in simple applications to get the job done.

2. MAP and MLE

Given data , if we model the relationship between the predictors and the
target as being governed by parameters , then Bayes’ rule tells us that

We can interpret the integral as an average over all models, where the weight of a model’s
contribution to the sum is governed by the term. This term is called the posterior or “a
posteriori” distribution, as it is the result of updating the prior or “a priori” distribution (which
reflects our previous beliefs about the parameters) with the information that the data tells us.

The average is difficult to compute, since the number of models grows exponentially with the
number of parameters. It is easier to just pick the model with the maximum a priori distribution,
rather than averaging over the entire ensemble. This is called Maximum A Posteriori (MAP)
estimation.

3
If we want to model as though we know nothing aside from what the data tells us, then we can
use the Jeffreys prior, which assigns as a uniform distribution and is also known as the
“uninformative” or “improper” prior since it does not actually depend on . When we perform
MAP estimation using the Jeffreys prior, we are doing what is known as Maximum Likelihood
Estimation (MLE). MLE derives its name from the fact that MAP with the Jeffreys prior amounts
to maximizing , which is known as the likelihood.

To visualize the relationship between the MAP and MLE estimations, one can imagine starting
at the MLE estimation, and then obtaining the MAP estimation by drifting a bit towards higher
density in the prior distribution.

4
3. Linear Regression

In linear regression, we model the target as a random variable whose expected value is
depends on a linear combination of the predictors (including a bias term, i.e. a column of 1s).
When the noise is assumed to be Gaussian, MLE simplifies to least-squares:

In multivariate linear regression, each is a vector containing multiple targets . If the

covariance matrix of the targets is a multiple of the identity matrix, then Gaussian MLE again
simplifies to least squares. Provided the targets are linearly related, we can cause the
covariance matrix to become a multiple of the identity matrix by converting the targets to an
orthonormal basis of principal components (this is known as PCA, or principal component
analysis).

One benefit of linear regression over more complex models is that linear regression is very
interpretable. Provided the predictors are normalized and are not linearly dependent, the
parameter or coefficient for a particular term can be interpreted as its “weight” in determining the
prediction. Even if the predictors are linearly dependent, we can still make the model
interpretable if we replace the predictors with a subset of their principal components before
performing the regression. This is called Principal Component Regression.

5
Some other types of linear regression include polynomial, logistic, and regularized (ridge)
regression. In polynomial regression, we include not just , but also , , etc. as predictors.
In logistic regression, where the target is binary, we model the target as a Bernoulli random
variable where the log of the odds ratio of the success probability is given by a linear regression:

In regularized or “ridge” regression, we assume a prior other than the Jeffreys prior. A Gaussian
prior gives rise to L2 regularization:

A Laplacian prior gives rise to L1 regularization:

4. Support Vector Machines

In logistic regression, we maximize the likelihood of assigning the correct target class probability
to a group of predictors. However, if our ultimate goal is to choose the most likely class for the
group of predictors, we care less about getting the probability perfect when the choice of class is

6
already determined (i.e. the probability is already fairly high or low), and more about choosing
the correct class when the probability is borderline. In this case, we should focus on finding the
best separation between the classes.

A Support Vector Machine (SVM) computes the “best” separation between classes as the
maximum-margin hyperplane, i.e. the hyperplane which maximizes the distance of the closest
points to the border (which are called the support vectors). For hyperplane of the form
, we can assume because dividing the equation by yields the same
plane, and thus the parameters which yield the maximum margin are given by

Through methods in constrained optimization, this “primal” form of the hyperplane can be
reparametrized in “dual” form by the parameters , one for each point in the data, which
are chosen as

and for which the hyperplane is stated as

where the sum is taken over the support vectors.

If the data is not linearly separable, then we can use a function to map the data
into a higher dimensional “feature” space before fitting the hyperplane. Below is an example of a
function which maps 2-dimensional data into 3-dimensional space.

The hyperplane, once projected to the lower-dimensional input space, is able to fit nonlinearities
in the data.

7
Normally, we would worry about blowing up the number of dimensions in the model, which
would cause computational and memory problems while training (fitting) the SVM. However, if
we choose a function (like the one above) for that can be represented by a “kernel,” we can
compute the result of dot products in the higher dimensional space, without having to compute
and store the values of the data in the higher dimensional space:

This is called the “kernel trick.” Some common kernels include the homogeneous polynomial
kernel , the inhomogeneous polynomial kernel ,
the Gaussian radial basis function (RBF) kernel , and the hyperbolic
tangent kernel .

Another way to extend the SVM to data which is not linearly separable is to use a soft margin,
where we minimize a loss function which penalizes data on the wrong side of the hyperplane.
We introduce a “hinge loss” function which is zero for data on the correct side of the margin and
is proportional to distance for data on the wrong side of the hyperplane, and minimize the total
hinge loss:

The hinge loss function is named as such because its graph looks like a door hinge. When the
data is linearly separable, the total hinge loss is minimized by our previous “hard margin”

8
method. In practice, it is difficult to deal with the constraint, but we want to prevent the
weights from blowing up, so we replace the constraint with a regularization:

A similar dual form exists for this minimization problem, on which the kernel trick is also
applicable.

SVMs can be extended to multiclass problems by building a model for each pair of classes and
then selecting the class which receives the most votes from the ensemble of models. This is
called one-vs-one (OvO) classification.

Another option is to force the SVM to give a probabilistic score rather than binary output, build a
model for each class against the rest of the data, and then select the class whose model gives
the highest score. This is called one-vs-rest (OvR) classification. To induce a probabilistic score,
one can interpret the distance from the hyperplane as the logit (log of the odds ratio) of the
in-class probability:

SVMs can also be used for regression, in which case they are called Support Vector Regressors
(SVRs). In SVRs, the support vectors are the furthest points form the hyperplane, and the task
is to minimize the distance to the support vectors.

5. Neural Networks

Neural Networks (NNs) consist of layers of “neurons,” where each neuron has an “activity”
which is computed as a function of the weighted sum of activities of neurons in the previous
later. The first layer of neurons are activated directly from the data, and the activation of a
particular neuron in the last layer represents the likelihood of the data belonging to a particular
class. NNs are similar to SVMs in that they project the data to a higher-dimensional space and
fit a hyperplane to the data in the projected space. However, whereas SVMs use a
predetermined kernel to project the data, NNs automatically construct their own projection by
iteratively adjusting (“training”) the weights in the intermediate (“hidden”) layers to minimize a

9
loss function. Unlike with the kernel trick in SVMs, the training of additional layers in NNs incurs
significant additional computational cost, and much work has been devoted to optimizing
algorithms and hardware usage to speed up the training of Deep Neural Networks (DNNs)
consisting of many hidden layers.

Each layer of a NN consists of a parameter matrix , where the th row vector contains the
weights received by the th neuron in the next layer. If we define as the activation function
which is applied component-wise (i.e. neuron-wise) at the th layer, and include a bias term as
a neuron in each layer whose activity is always 1, then the output activities of the network
layers after the first layer is given by

We can write this recursively as

When counting layers, we do not count the first layer because it reads in the data and is
therefore not associated with trainable weight parameters. This way, each layer is associated to
a weight matrix.

For a regression network with layers, a loss function is chosen to compare the output
to the desired target from the data. Common choices for this loss function include L1
and L2 error. For a classification network with layers, we normalize the output to
so that we can interpret it as a probability. Then, a loss function
is chosen to compare the discrepancy between the output and the desired target
(“ground truth”) from the data. For classification, the loss function is usually chosen as the
cross-entropy. For each data point, the cross entropy is given by

To make sense of the cross-entropy, notice that it can be simplified to

and if the ground truth is a single class , i.e. and , then it becomes

10
The training algorithm, called gradient descent, consists of iteratively updating the weights at
each layer according to

where is the learning parameter which governs how quickly the weights change. There are
other variations of gradient descent, such as stochastic gradient descent (SGD), where the
learning parameter is randomized to assist the weights in breaking out of a shallow minima
while allowing them to settle into a deeper minima, and SGD with momentum, which is meant to
mimic the trajectory of a ball rolling down a bumpy hill into a valley. The main problem in all of
these methods, though, is computing the derivative (“gradient”) of the loss function with respect
to the weights. Luckily, there is a pattern to it, which we will see after computing the gradient for
.

Computing for , we have

Where the operation represents the Hadamard product. We define

so that

Computing for , we have

11
We define

so that

Computing for , we have

We define

so that

Putting it all together, we have that

where

12
This method for computing the gradients is called “backpropagation,” because we propagate the
terms backwards through the layers, from the last layer to the first layer.

The equations for backpropagation also give us insight into our choice of activation function. If
we choose a sigmoidal activation function which levels off, then the gradient will vanish for
neurons whose activations are too large in magnitude. But if we choose a linear activation
function which maintains a slope of 1 everywhere, then we have nothing more than a linear
model, and the network is unable to project the data into a higher-dimensional space before
fitting the hyperplane. The solution is to use a “rectified” linear unit (ReLU) which is linear for
positive inputs, and zero for negative inputs:

Ideally, we’d use a “softmax” function which is differentiable at zero unlike ReLUs, but ReLUs
are so much faster to compute that we use them anyway. We can usually get away with the
slope being zero for negative inputs to the ReLU because the weighted sums in the network
tend to be positive sometimes. However, if we set the learning rate too high, we can sometimes
end up with neurons whose weighted sums are always negative, and consequently whose
gradients and activity are always zero. To overcome this problem of “dead” neurons, one can
use leaky ReLUs which have a small gradient and activity even for negative inputs:

That being said, ReLUs may not be the best choice for the output layer of the network, which is
supposed to represent a regression or classification prediction. For regressions, linear activation
functions are a better choice in the final layer, and for classifications, softmax units are a better
choice in the final layer.

Due to the large number of parameters in NNs, they are prone to overfitting. However, the risk
of overfitting can be reduced by “dropout,” a method used to avoid training all of the weights on
all of the training data. Dropout involves randomly turning of or “dropping out” neurons from the
network during each training iteration, and then keeping the weights of those neurons
unchanged during the weight update. Dropout also increases training speed, since dropping out
half the neurons in a network cuts the number of computations in half.

One type of neural network that has seen widespread success in the realm of image processing
is the convolutional neural network (CNN), which is reduces the number of parameters (thus
enabling deep networks of many layers) by taking advantage of spatially local input patterns. In
CNNs, each layer of neurons is really a stack of sub-layers, and each neuron in a sub-layer is
connected to only a small region (“receptive field”) of a single sub-layer in the preceding layer.
Receptive field weights are shared across neurons within a sub-layer, thus forming a template
(“convolution”) that can be interpreted as the pattern of activation that the sub-layer is trained to

13
detect within the sub-layer in the preceding layer. A sub-layer’s convolution can be expressed
as a weighted sum of different offsets of the convolution in the sub-layer in the preceding layer,
and by carrying the weighted sum through all the layers down to the input layer, one can see the
visual feature that the sub-layer is trained to detect within the image. Visual features of
sub-layers within lower layers are usually simple, like lines and edges, whereas visual features
of sub-layers within higher layers can be complex, like faces or cars.

6. Decision Trees

One drawback of SVMs and NNs is that they are black-box models, meaning they are
uninterpretable. Although they can model highly nonlinear data, we can’t make much sense of
what the model has learned by looking at the parameters. On the other hand, the parameters in
linear regressions make intuitive sense as the contributions of individual factors to the overall
decision, but they are restricted by linearity and thus won’t make a good predictive model for
highly nonlinear data. Decision trees bridge the gap and are able to model nonlinear data while
remaining interpretable.

A decision tree constructs a model by recursively partitioning (“splitting”) class data along some
value of a predictor (“attribute”) until each partition represents a single class of data. The tree
starts with a single node which represents all of the data, and then splits into two child nodes to
separate the data into two groups which are as homogeneous (“pure”) as possible. Then, each
child node performs the same splitting process to produce two more child nodes of maximum
purity, and so on, until each terminal node (“leaf”) of the tree is 100% pure or the data cannot be
split any more (sometimes otherwise identical records may have different classes). The
predicted probability distribution for the class of any input is computed as the frequency
distribution of classes within the input’s corresponding leaf.

The metric which is used to quantify the purity of a split is called the splitting criterion, and it is
often chosen as information gain or Gini impurity. Information gain measures the reduction in
impurity (“information entropy”) achieved by a split. Information entropy for a node is measured
by the expectation value

over data points and classes within the node, where is the proportion
of data points in the parent node that have . Information entropy is largest for uniform
distributions, and zero for distributions which are concentrated at a single point. Information gain

14
is the entropy of the parent node, minus the weighted average entropy of the child nodes
(weighted by its proportion of data points from the parent node). Gini impurity is very similar to
information entropy, just a little faster to compute. It is given by

To prevent the tree from overfitting the data, which it will almost certainly do if left to construct an
unlimited number of partitions, the tree is “pruned.” Pruning can be achieved by stopping the
tree prior to full growth, in which it is called pre-pruning, or by cutting the tree short after full
growth, in which it is called post-pruning. Pre-pruning can be achieved by avoiding splitting a
node if the split purity is below some threshold value -- though, any choice of such value is
rather ad-hoc. With post-pruning, it is possible to take a more principled approach, using
cross-validation to check the effect of pruning on the tree’s test accuracy.

7. Ensemble Methods

One big disadvantage of decision trees is that they have a high variance (i.e. they are unstable,
not robust to noise in the data). A slight change in the data can cause a different split to occur,
giving rise to different child nodes and splits all the way down the tree, potentially leading to
different predictions. However, we can make the predictions more stable by averaging them
across an ensemble of many different decision trees, called a random forest. Random forests
grow a variety of decision trees by forcing each split attribute to be selected from a random
subset of candidates. They also train each tree using a random subset of the available training
data, which is known as bootstrap aggregating or “bagging” for short.

In general, bagging constructs an ensemble of models which reduces model variance, making it
suitable for complex models (low bias, high variance). For simple models (high bias, low
variance), another ensemble model called gradient boosting can be used to reduce model bias.
Gradient boosting performs gradient descent on a cost function by building the ensemble as a
sequence of “error-correcting” models, where each model is trained on a subset of the training
data that emphasizes instances that were misclassified by the preceding model. The output of
the ensemble is a weighted average, where the weight given to a model’s prediction depends
on the model’s accuracy.

Another example of an ensemble that we encountered earlier was the Bayes optimal ensemble,
which averages over all sets of models with a given set of parameters. Since the average is
difficult to compute, we settled for choosing the single model which contributed most to the
ensemble (MAP/MLE). However, are ways to approximate the average through sampling,
known as Bayesian Model Combination (BMC).

15
The type of ensemble model that wins most data science competitions, perhaps surprisingly, is
not the Bayes optimal ensemble. Rather, it is the stacked model, which consists of an ensemble
of entirely different species of models together with some combiner algorithm (usually chosen
as a logistic regression) which is trained to make a final prediction using the predictions of the
models within the ensemble as additional inputs. Although the Bayes optimal ensemble
performs at least as well as (and often better than) stacking when the correct data-generating
model is on the list of models under consideration, the correct data-generating model for data
difficult enough to warrant a competition is often too complex to be approximated by models in
the Bayes optimal ensemble. In such cases, it is advantageous to have a diverse basis of
models from which to approximate the data-generating model.

Simca 15 User Guide PDF
No ratings yet
Simca 15 User Guide PDF
517 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Support Vector Machine
100% (2)
Support Vector Machine
11 pages
Introduction From Fault Detection To Fault Tolerance
No ratings yet
Introduction From Fault Detection To Fault Tolerance
8 pages
Abbreviated Self Leadership Questionnaire - Vol7.Iss2 - Houghton - pp216-232 PDF
100% (1)
Abbreviated Self Leadership Questionnaire - Vol7.Iss2 - Houghton - pp216-232 PDF
17 pages
Unit 3 PPT
No ratings yet
Unit 3 PPT
20 pages
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
No ratings yet
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
16 pages
machine learning notes
No ratings yet
machine learning notes
19 pages
dwm exp4 a49
No ratings yet
dwm exp4 a49
11 pages
DS - UNIT - III - QB & Ans
No ratings yet
DS - UNIT - III - QB & Ans
25 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
No ratings yet
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
6 pages
ML Questions 2021
100% (1)
ML Questions 2021
26 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
6 pages
Ann-Unit Ii
No ratings yet
Ann-Unit Ii
21 pages
module 3
No ratings yet
module 3
7 pages
6 Min Read: Siwei Xu Aug 27
No ratings yet
6 Min Read: Siwei Xu Aug 27
4 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
ML Unit-4
No ratings yet
ML Unit-4
28 pages
Unit 1
No ratings yet
Unit 1
15 pages
Ai Notes V
No ratings yet
Ai Notes V
7 pages
Support Vector Machine - Wikipedia, The Free Encyclopedia
No ratings yet
Support Vector Machine - Wikipedia, The Free Encyclopedia
12 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
Pa Mod - 3,4,5
No ratings yet
Pa Mod - 3,4,5
47 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
7 pages
MLT unit-4 notes
No ratings yet
MLT unit-4 notes
30 pages
Sem Rpa
No ratings yet
Sem Rpa
61 pages
SEM MLOps
No ratings yet
SEM MLOps
58 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Machine_learning(unit 3)
No ratings yet
Machine_learning(unit 3)
9 pages
Classification Models
No ratings yet
Classification Models
3 pages
M.L. 3,5,6 Unit 3
No ratings yet
M.L. 3,5,6 Unit 3
6 pages
Ijcsea 2
No ratings yet
Ijcsea 2
13 pages
Document
No ratings yet
Document
6 pages
Assignment 2
No ratings yet
Assignment 2
111 pages
Unit 2
No ratings yet
Unit 2
7 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
11 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
2B Naive Bayes
No ratings yet
2B Naive Bayes
90 pages
SVM - Feb 15
No ratings yet
SVM - Feb 15
34 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
ML-II UNIT-1
No ratings yet
ML-II UNIT-1
4 pages
6 Easy Steps To Learn Naive Bayes Algorithm (With Code in Python)
No ratings yet
6 Easy Steps To Learn Naive Bayes Algorithm (With Code in Python)
3 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Bayesian Programming
No ratings yet
Bayesian Programming
16 pages
BAI 3303 Notes
No ratings yet
BAI 3303 Notes
12 pages
Chapter 2 Market Risk
No ratings yet
Chapter 2 Market Risk
3 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Machine Learning Doc-2
No ratings yet
Machine Learning Doc-2
8 pages
ml 5
No ratings yet
ml 5
28 pages
Naive Bayes Algorithm
No ratings yet
Naive Bayes Algorithm
3 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
9 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Chapter 11 KNN Naive Bayes and LDA
No ratings yet
Chapter 11 KNN Naive Bayes and LDA
15 pages
Module 5
No ratings yet
Module 5
6 pages
DAV Question Bank+Answe
No ratings yet
DAV Question Bank+Answe
54 pages
Unvilling - Shapes - P
No ratings yet
Unvilling - Shapes - P
46 pages
Why Use PCA
No ratings yet
Why Use PCA
85 pages
Machine Learning (Part 1) : Iykra Data Fellowship Batch 3
No ratings yet
Machine Learning (Part 1) : Iykra Data Fellowship Batch 3
28 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
DL - Assignment 6 Solution
100% (3)
DL - Assignment 6 Solution
6 pages
ML Final
100% (1)
ML Final
28 pages
Image Processing and Computer Vision Unit 5
No ratings yet
Image Processing and Computer Vision Unit 5
7 pages
Characterization of Pasture Production Systems in Makueni County of Kenya
No ratings yet
Characterization of Pasture Production Systems in Makueni County of Kenya
8 pages
Media Framing of Copenhagen Tourism A New Approach To Public Opinion About Tourists
No ratings yet
Media Framing of Copenhagen Tourism A New Approach To Public Opinion About Tourists
13 pages
Principal Component Analysis (PCA) Based Indexing: March 2017
No ratings yet
Principal Component Analysis (PCA) Based Indexing: March 2017
6 pages
Rainfall Mini Project Report
100% (1)
Rainfall Mini Project Report
38 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (1)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Automatic Attendance
No ratings yet
Automatic Attendance
3 pages
AI UNIT - 5 Notes
No ratings yet
AI UNIT - 5 Notes
10 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
115 pages
Paper-Simple Poisson PCA An Algorithm For Sparse Feature
No ratings yet
Paper-Simple Poisson PCA An Algorithm For Sparse Feature
19 pages
First-Principles, Data-Based, and Hybrid Modeling and Optimization of An Industrial Hydrocracking Unit
No ratings yet
First-Principles, Data-Based, and Hybrid Modeling and Optimization of An Industrial Hydrocracking Unit
10 pages
Walkability Indicators For Pedestrian-Friendly Design: Stefano Gori, Marialisa Nigro, and Marco Petrelli
No ratings yet
Walkability Indicators For Pedestrian-Friendly Design: Stefano Gori, Marialisa Nigro, and Marco Petrelli
8 pages
Characterization of The Pure Black Tea Wine Fermentation Process by Electronic Nose and Tongue-Based Techniques With Nutritional Characteristics
No ratings yet
Characterization of The Pure Black Tea Wine Fermentation Process by Electronic Nose and Tongue-Based Techniques With Nutritional Characteristics
10 pages
1 Dietary Patterns of University Students in The UK 20.18
No ratings yet
1 Dietary Patterns of University Students in The UK 20.18
17 pages
Exercises w3
No ratings yet
Exercises w3
6 pages
Math-803-Lecture 19-Matrix_Approx_PCA
No ratings yet
Math-803-Lecture 19-Matrix_Approx_PCA
17 pages
Spectral Analysis With ENVI5.0
100% (1)
Spectral Analysis With ENVI5.0
99 pages
LCaN-23-Transforming the Neuroscience of Language Estimating Pattern to Pattern Transformations of Brain Activity
No ratings yet
LCaN-23-Transforming the Neuroscience of Language Estimating Pattern to Pattern Transformations of Brain Activity
17 pages
Multivariate Statistical Analysis: Old School
No ratings yet
Multivariate Statistical Analysis: Old School
319 pages
Machine Learning With MATLAB Quick Reference
No ratings yet
Machine Learning With MATLAB Quick Reference
36 pages
A Study of Consumer Behavior: Ready To Eat Food in Delhi/Ncr
No ratings yet
A Study of Consumer Behavior: Ready To Eat Food in Delhi/Ncr
50 pages
The Effect of Work Stress On The Performance of Readymade Garment Workers in Bangladesh
No ratings yet
The Effect of Work Stress On The Performance of Readymade Garment Workers in Bangladesh
11 pages
Grasso 1998
No ratings yet
Grasso 1998
18 pages
Del Greco Asertividad
No ratings yet
Del Greco Asertividad
15 pages
Component Factor Analysis
No ratings yet
Component Factor Analysis
15 pages

jpskycak-2018-intuiting-predictive-algorithms-1

Uploaded by

jpskycak-2018-intuiting-predictive-algorithms-1

Uploaded by

Intuiting Predictive Algorithms

Justin Skycak, 2018

2. MAP and MLE 3

4. Support Vector Machines 6

2. MAP and MLE

In multivariate linear regression, each is a vector containing multiple targets . If the

A Laplacian prior gives rise to L1 regularization:

4. Support Vector Machines

and for which the hyperplane is stated as

where the sum is taken over the support vectors.

We can write this recursively as

To make sense of the cross-entropy, notice that it can be simplified to

Computing for , we have

Where the operation represents the Hadamard product. We define

Computing for , we have

Computing for , we have

Putting it all together, we have that

You might also like