03
03
)
[email protected]
11/11/21
¡ An algorithm is a procedure or set of steps or rules to
accomplish a task. Algorithms are one of the fundamental
concepts in, or building blocks of, computer science.
¡ Some of the basic types of tasks that algorithms can solve are
§ sorting, searching, and graph-based computational problems
¡ In data science, there are at least three classes of algorithms
one should be aware of;
§ Data munging, preparation, and processing algorithms, such as
sorting, MapReduce, or Pregel.
§ Optimization algorithms for parameter estimation, including
Stochastic Gradient Descent, Newton’s Method, and Least Squares.
§ Machine learning algorithms.
11/11/21
¡ Machine learning algorithms are largely used to
predict, classify, or cluster.
¡ Machine learning algorithms are the basis of artificial
intelligence (AI) such as image recognition, speech
recognition, recommendation systems, ranking and
personalization of content.
¡ Machine learning algorithms are described as
learning a target function (f) that best maps input
variables (X) to an output variable (Y): Y = f(X)
11/11/21
¡ Linear Regression
¡ Logistic Regression
¡ Linear Discriminant Analysis
¡ K-Means
¡ Classification and Regression Trees
¡ Naive Bayes
¡ K-Nearest Neighbors
¡ Learning Vector Quantization
¡ Support Vector Machines
¡ Bagging and Random Forest
¡ Boosting and AdaBoost
¡ PCA
11/11/21
¡ Is one of the fundamental supervised machine-learning algorithms
due to its relative simplicity and well-known properties.
¡ Is one of the most well known and understood algorithms in
statistics and ML.
¡ Express the mathematical relationship between two variables or
attributes.
¡ Predictive modeling is primarily concerned with minimizing the
error of a model or making the most accurate predictions, at the
expense of explain ability.
11/11/21
¡ Linear regression might be simple linear or
multivariant.
§ The case of one explanatory variable is called simple linear
regression.
§ More than one explanatory variable, the process is
called multiple linear regression.
¡ Assumption
§ There is a linear relationship between an outcome variable
(dependent variable) and a predictor (independent variable
or feature).
11/11/21
¡ The relationship between independent and dependent
variables by fitting a best line using the coefficients a and
b are derived from the given input by minimizing the
sum of squared difference of distance between data
points and regression line.
11/11/21
$% ∑ + − +, - ∑ ( − (̅ -
!=# $% = $& =
$& .−1 .−1
0 = +, − !(̅
∑ ( − (̅ + − +, Where: r – Pearson correlation coefficient variable,
#= $& $% – Standard deviation,
∑ ( − (̅ -∑ ( − (̅ - (̅ – x mean and
+, – y mean
11/11/21
Glucose
No Age (X)
level (Y)
XY X2 Y2 486 11409 − (247)(20485)
!=
1 43 99 4257 1849 9801 6 11409 − (247)/
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
! = 65.1416
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022 2 /3456 7(/48)(452)
b= 2 9943: 7 (/48);
b= 0.38522
y=65.141 + 0.38522*x
11/11/21
§ Construct the linear that shows the growth of the
population in Ethiopia using the data from The data is
from 51 different states of USA. The variables are y =
year 2002 birth rate per 1000 females 15 to 17 years old
and x = poverty rate, which is the percent of the state’s
population living in households with incomes below the
federally defined poverty level. (Data source: Mind On
Statistics, 3rd edition, Utts and Heckard).
11/11/21
¡ Clustering is the process of partitioning a group of data points
into a small number of clusters.
¡ K-means clustering is a type of unsupervised learning, which
is used when you have unlabeled data.
§ The goal of this algorithm is to find groups in the data, with the number
of groups represented by the variable K.
¡ The algorithm works iteratively to assign each data point to
one of K groups based on the features provided.
¡ Data points are clustered based on feature similarity. The
results of the K-means clustering algorithm are:
¡ The centroids of the K clusters, which can be used to label
new data
¡ Each data point is assigned to a single cluster.
11/11/21
¡ Behavioral segmentation
§ Segment by purchase history, activities on application, website,
or platform
§ Define personas based on interests
¡ Inventory categorization
§ Group inventory by sales activity and manufacturing metrics
¡ Sorting sensor measurements
§ Detect activity types in motion sensors
§ Group images, Separate audio and Identify groups in health
monitoring
¡ Detecting bots or anomalies
§ Separate valid activity groups from bots
§ Group valid activity to clean up outlier detection
11/11/21
¡ The Κ-means clustering algorithm uses iterative refinement to produce a
cluster. The algorithm inputs are the number of clusters Κ and the data set.
§ The data set is a features for each data point. The algorithms starts with initial
estimates of Κ centroids, which can either be randomly generated or randomly
selected from data set.
¡ The algorithm then iterates between the following step:
§ Initially, randomly pick k centroids (or points that will be the center of your
clusters) in d-space. Try to make them near the data but different from one
another.
§ Then assign each data point to the closest centroid.
§ Move the centroids to the average location of the data points (which correspond
to users in this example) assigned to it.
§ Repeat the preceding two steps until the assignments don’t change, or change
very little. (i.e., no data points change clusters, the sum of the distances is
minimized, or some maximum number of iterations is reached).
11/11/21
¡ The algorithm finds the clusters and data set labels for a
particular pre-chosen K. To find the number of clusters, the
user needs to run the K-means clustering algorithm for a range
of K and compare the results.
§ There is no method for determining exact value of K, but an accurate
estimate can be obtained using the following techniques.
Given n data points xi, i=1...n to be partitioned in k clusters
11/11/21
¡ Given the following two table cluster the data using
k-means algorithm
No x y Cluster
No x y Cluster A 1 1
1 185 72 B 1 0
2 170 56 C 0 2
3 169 60 D 2 4
4 179 68 E 3 4
5 182 72 F 1 2
6 188 77 G 2 3
H 1 0
11/11/21
¡ KNN can be used for both classification and regression
predictive problems.
¡ K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on majority
(similarity) vote.
¡ KNN has been used in statistical estimation and pattern
recognition. Three important aspects of KNN:
§ Ease to interpret output
§ Calculation time and
§ Predictive Power
11/11/21
¡ Distance measures are only valid for continuous
variables.
11/11/21
¡ Given table cluster the data using k-NN algorithm
No Durability Strength Cluster Distance
A 7 7 Weak
B 7 4 Weak
C 3 4 Strong
D 3 4 Strong
E 1 3 Strong
F 5 5 ?????
¡ The data is from 51 different states of USA. The variables are y = year
2002 birth rate per 1000 females 15 to 17 years old and x = poverty rate,
which is the percent of the state’s population living in households with
incomes below the federally defined poverty level. (Data source: Mind On
Statistics, 3rd edition, Utts and Heckard).
11/11/21
Weight height Cluster Distance
¡ Given the following predict 51 167 Underweight
§ Weight 60 with Height 180
62 182 Normal
69 176 Normal
64 160 Overweight
65 172 Normal
56 174 Underweight
68 158 Overweight
57 173 Normal
58 169 Normal
68 158 Overweight
55 170 Normal
58 184 Underweight
11/11/21
¡ SVM is a supervised ML algorithm used for both
classification or regression mostly used in classification
problems.
¡ The algorithm plot each data item as a point in n-
dimensional space with the value of each feature being
the value of a particular coordinate.
§ Classification done by finding the hyper-plane that differentiate the two
classes very well.
11/11/21
¡ is a classification technique based on Bayes’ theorem with an
assumption of independence among predictors.
¡ Naive Bayes model is easy to build and particularly useful for very
large data sets. Along with simplicity, Naive Bayes is known
to outperform even highly sophisticated classification methods.
!(#/") ∗ !(")
! "# =
!(#)
¡ Given
§ P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
§ P(c) is the prior probability of class.
§ P(x|c) is the likelihood which is the probability of predictor given class.
§ P(x) is the prior probability of predictor.
11/11/21
¡ Given the following
Weather Play Weather No Yes Probability
Sunny No Sunny 2 3 =5/14 (0.36)
Overcast Yes Overcast 4 =4/14 (0. 29)
Rainy Yes
Rainy 3 2 =5/14 (0.36)
Sunny Yes
Total 5 9
Sunny Yes
Probability =5/14 (0.36) =9/14 (0.64)
Overcast Yes
Rainy No § What is the probability of players will play if weather is sunny ?
Rainy No
5 6
Sunny Yes ((*+,,-/-/*)∗((-/*) ∗
Rainy Yes
! "#$ $%&&" = ((*+,,-)
= 0.6 = 6 78
9
78
Sunny No
Overcast Yes
§ What is the probability of players will play if weather is rainy ?
Overcast Yes
Rainy No
11/11/21